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Abstract 

This paper deals with the trace regression model where n entries or linear combi- 
nations of entries of an unknown mi x mi matrix Aq corrupted by noise are observed. 
We propose a new nuclear norm penalized estimator of Aq and establish a general 
sharp oracle inequality for this estimator for arbitrary values of n, mi, mi under the 
condition of isometry in expectation. Then this method is applied to the matrix 
completion problem. In this case, the estimator admits a simple explicit form and 
we prove that it satisfies oracle inequalities with faster rates of convergence than 
in the previous works. They are valid, in particular, in the high-dimensional setting 
TO1TO2 ^> n. We show that the obtained rates are optimal up to logarithmic factors 
in a minimax sense and also derive, for any fixed matrix Aq, a non-minimax lower 
bound on the rate of convergence of our estimator, which coincides with the upper 
bound up to a constant factor. Finally, we show that our procedure provides an exact 
recovery of the rank of Aq with probability close to 1. We also discuss the statistical 
learning setting where there is no underlying model determined by Aq and the aim 
is to find the best trace regression model approximating the data. 
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1 Introduction 



Assume that we observe n independent random pairs (Xj, Yi), i = 1, . . . , n, where Xi are 
random matrices with dimensions ra\ x ni2 and Y{ are random variables in M, satisfying 
the trace regression model: 

E(Y i \X i )=tv(XjA ), » = l,...,n, (1.1) 

where ^4o £ R miXm2 is an unknown matrix, E(Yi|Aj) is the conditional expectation of Yj 
given Aj, and tr(-B) denotes the trace of matrix B. We consider the problem of estimation 
of Aq based on the observations (Xi, Yi), i = 1, . . . , n. Though the results of this paper are 
obtained for general n,mi,rri2, the main motivation is in the high-dimensional setting, 
which corresponds to m\mi 3> n, with low rank matrices Aq. 
It will be convenient to write the model (jl.ip in the form 

Y i = tT(XTA ) + Si, i = l,...,n, (1.2) 

where the noise variables £j = Yi — E(Y^|Aj) are independent and have zero means. 
For any matrices A,BE W niXrri2 , we define the scalar products 

(A,B) =tv(A T B) 

and 

1 n 

(A, B) L2(Jl) = - E((A Xi)(B, Xi)) . 
-i=i 

Here II = - Y17=i where Ilj denotes the distribution of X^ The corresponding norm 
PIU 2 (n) is given by 

1 n 

PHL(n) = ^E E (^^) 2 )- 

i=i 

Example 1. Matrix Completion. Assume that the design matrices Xi are i.i.d. 

uniformly distributed on the set 

X = |ej(mi)ej(m 2 ), 1 < j < mi, 1 < k < m 2 | , (1.3) 

where e^m) are the canonical basis vectors in R m . The set X forms an orthonormal 
basis in the space of m\ x m<i matrices that will be called the matrix completion basis. 
Let also n < m\m2- Then the problem of estimation of Aq coincides with the problem of 
matrix completion under uniform sampling at random (USR) as studied in the non-noisy 
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case (£j = 0) in [2T] , and in the noisy case in \23\ [T3] . Considering low rank matrices 
Aq is of a particular interest. Clearly, for such Xi we have the isometry 

U\\l 2{ n)=^ 2 U\\l (1-4) 

for all matrices ^4 E M miXm2 , where /U = ^^2, and ||A||2 is the Frobenius norm of A. 
However, the restricted isometry property in the usual sense, i.e., "in probability", cf., 
e.g., [22], does not hold for matrix completion, since for n < m\mi there trivially exists 
a matrix of rank 1 in the null space of the sampling operator. 

One can also consider more general matrix measurement models in which, for a 
given orthonormal basis in the space of matrices, a random sample of Fourier coefficients 
of the target matrix Aq is observed subject to a random noise. For more discussion on 
matrix completion with other types of sampling, see [5J [TU| [TT| [To"! [T7] and references 
therein. 

Example 2. Column masks. Assume that the design matrices Xi are i.i.d. repli- 
cations of a random matrix X, which has only one nonzero column. For instance, let the 
distribution of X be such that all the columns have equal probability to be non-zero, 
and the random entries of non-zero column xy-^ are such that K(x^xJ^) is the identity 
matrix. Then P||i 2(n) = ||vl|||/m2, \fA E R m i xm 2, so that condition flUD is satisfied 
with \x = ■ s /ni2- More generally, in view of application to multi-task learning, cf. [23], 
one can be interested in considering non-identically distributed Xi. The model can be 
then reformulated as a longitudinal regression model, with different distributions of Xi 
corresponding to different tasks. 

Example 3. "Complete" subgaussian design. Assume that the design ma- 
trices Xi are i.i.d. replications of a random matrix X such that (A, X) is a subgaus- 
sian random variable for any A E M miXm2 . This approach has its roots in compressed 
sensing. The two major examples are given by the matrices X whose entries are ei- 
ther i.i.d. standard Gaussian or Rademacher random variables. In both cases, we have 
||A||| 2(n) = \\A\\%, \/A E W niXm2 , so that condition (fO|) is satisfied with fj, = 1. The 
problem of exact reconstruction of Aq under such a design in the non-noisy setting was 
studied in |22[ [UJ [THJ , whereas estimation of Aq in the presence of noise is analyzed in 
[18 1 123 } 19], among which |23|, [9| [T7] treat the high-dimensional case m\m,2 > n. 

Example 4. Fixed design. Assume that all the IT are Dirac measures, so that 
the design matrices Xi are non-random. Then H-AHf^m) = ^ ^" =1 (^4, Aj) 2 , and we get 
the problem of trace regression with fixed design, cf. |23j . In particular, if m\ = mi. 
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and A and Xi are diagonal matrices the trace regression model (jl.2p becomes the usual 
linear regression model. Accordingly, the rank of A becomes the number of its non-zero 
diagonal elements. This observation will allow us to deduce, as a consequence of our 
general arguments, an oracle inequality for sparse linear regression with fixed design and 
the Lasso improving [B] in the sense that the inequality is sharp (cf. Theorem [5] and 
Section f5T3|) . 

The general oracle inequalities that we will prove in Section [2] can be successfully ap- 
plied to the above examples. The emphasis in this paper will be on the matrix completion 
problem (Example 1), for which the previously obtained results were suboptimal. 

Statistical estimation of low rank matrices has recently become a very active field 
with a rapidly growing literature. The most popular methods are based on penalized 
empirical risk minimization with nuclear norm penalty [H [21 O [5j [U [9l [131 IS 123] - 
Estimators with other types of penalization, such as the Schatten-p norm [23], the von 
Neumann entropy [T7], penalization by the rank [12] or some combined penalties [13] 
are also discussed. 

It is worth pointing out that in many applications, such as in matrix completion, 
the distribution II is known, and yet this information has not been exploited since the 
penalized estimation procedures considered in the literature involve the empirical risk 
n ^i=i0^i ~ tr(J 5 C 2 ^~yl)) 2 (cf. [231 US])- in this paper we incorporate the knowledge of II 
in the construction and we study the following estimator of Aq: 

A X = argmin AeA L n (A), (1.5) 

where A C H> m i xm 2 i s a convex set of matrices, 

L n (A) = \\A\\l 2{u) - l-Y^Y^A) + \\\A\\ U (1.6) 
\ n i=i I 

A > is a regularization parameter, and is the nuclear norm of A. Note that if all 
Xi are non-random, A x coincides with the usual matrix Lasso estimator: 



A =argmin AeA 



j=l 



;i.7) 



The emphasis in this paper is on the noisy matrix completion setting. Then the esti- 
mator A x has a particularly simple form; it is obtained from the matrix mi r [" 2 Yli=l YiX{ 
by soft thresholding of its singular values. One of the main results of this paper is to 
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show that our estimators are rate optimal (up to logarithmic factors) under the Frobe- 
nius error for a simple class of matrices A(r, a) defined by two restrictions: the rank of 
Aq is not larger than given r and all the entries of Aq are bounded in absolute value by 
a constant a. This rather intuitive class has been first considered in [15]. However, the 
construction of the estimator in [T3] requires the exact knowledge of rank(^4o) an d the 
upper bound on the Frobenius error obtained in [15j is suboptimal (see the details in 
Section [3J). The recent preprint [T3] obtains suboptimal bounds of "slow rate" type for 
matrix completion. The analysis in [T7] is focused on complex- valued Hermitian matrices 
with nuclear norm equal to 1 and motivated by density matrix estimation problem in 
quantum state tomography. These papers do not address the optimality issue. Optimal 
rates in noisy matrix completion are derived in [23], but on different classes of matrices 
and with the empirical prediction error rather than with the Frobenius error. Finally, 
a very recent preprint [TI5] discusses the optimality issue for the Frobenius error on the 
classes defined in terms of a "spikiness index" of Aq, which are not related to A(r,a), 
and proposes estimators that require prior information about this index. 

The main contributions of this paper are the following. In Section [2J we derive a 
general oracle inequality for the prediction error ||^4 A — -Aoll^mv This oracle inequality 
is sharp, i.e., with leading constant 1, both in the case of "slow rate" (for matrices Aq 
with small nuclear norm) and in the case of "fast rate" (for matrices Aq with small 
rank). As a particular instance of this general result, in Section [3J we obtain an oracle 
inequality for the matrix completion problem. In Sectional we establish minimax lower 
bounds showing that the rates for matrix completion obtained in Section [3J are optimal 
up to a logarithmic factor. In Section O we briefly discuss some other implications and 
extensions of our method. Finally, Section [6] is devoted to the control of the stochastic 
term appearing in the proof of the upper bound. 

2 General oracle inequalities 

We recall first some basic facts about matrices. Let A £ u m i xm 2 kg a rectangular matrix, 
and let r = rank(yl) < min(mi,m2) denote its rank. The singular value decomposition 
(SVD) of A has the form: A = X^=i o~j(A)iijvJ with orthonormal vectors u\, ... ,u r € 
W™ 1 , orthonormal vectors v±, . . . ,v r £ M m ' 2 and real numbers o~i(A) > • • • > cr r (A) > 
(the singular values of A). The pair of linear vector spaces (Si, S2) where Si is the linear 
span of {ui, . . . , u r } and S2 is the linear span of {vi, . . . , v r } will be called the support 
of A. We will denote by Sj- the orthogonal complements of Sj, j = 1, 2, and by P5 the 
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projector on the linear vector subspace S of W nij , j = 1, 2. 

The Schatten-p (quasi-)norm \\A\\ p of matrix A is defined by 

/ min(mi ,m,2) \ ^ 

Ml,, I ^2 a j( A ) P ] forO<p<oo, and \\A\loo = ai(A). 
Recall the well-known trace duality property: 



tr(A T B) 



< plliUBlloo, VA,B€ 



We will also use the fact that the sub differential of the convex function Ah> ||-A||i is the 
following set of matrices: 



a|| = {£u jV ] + p st wp si : \\wu < l} 



(2.1) 



(cf. 



I. Define the random matrix 

1 ™ 

M = -Y^{YiXi-E(YiXi)). 



(2.2) 



We will need the following assumption on the distribution of the matrices Xj. 



Assumption 1 There exists a constant fi > suc/i i/iai, /or matrices A £ A — A :- 
{A 1 -A 2 : A U A 2 e A}, 

WAW 2 > //- 2 II4II 2 



As discussed in the Introduction, Assumption Q] is satisfied, often with equality and for 
A = A — A = M. miXm2 , in several interesting examples. The next theorem plays the key 
role in what follows. 



Theorem 1 Let A C 



be any set of matrices. If X > 2HMHOQ, then 



\A X - A \\ 2 T , m < inf 



l^-^o||! 2( n) + 2A||A| 



(2.3) 



If, in addition, A is a convex set and Assumption 1 is satisfied, then 

2 

\A X - A \\l m < inf [\\A - A \\l m + ' ' ' 



,l-,lnll?.,m+l^^l /i 2 A 2 rank(A) 



(2.4) 
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Furthermore, in this case for all A E A with support (Si, S2), 



\A X - A \\l 2(n) + (A - 2[|M|| 0O )[|P Sj Li A P 5 x||i (2.5) 
< \\A - A \\l m + f^^] /i 2 A 2 rank(A). 



PROOF. It follows from the definition of the estimator A that, for all A E A, 
L n (A x ) = \\A x \\l 2i u) ~ (- f>^, i A \ + X\\A% < 



n 

i=l 



2 n 



ll^lli 2( n) -{-Y, Y ^ A ) + X W A h = Ln{A). 
\ j=i ' 

Also, note that 

-1 n 1 n 1 " 

-^E(F i X i ) = -^E((A ,X i )X i ) and - ^(E^), A) = {A ,A) Lm . 

i=l i=l i=l 

Therefore, we have 

ll iA Hi 2( n) -2(i A ,^ >L 2 (n) < \\A\\l 2{u) -2(A,A Q ) L2{n) + 

n > 

J2( Y i x i - HY t X l )),A x -A) + A(||A||i - ||i A ||i), 
i=i ' 

which implies, due to the trace duality, 



\\A X - A \\ 2 Lm < \\A - A \\l m + 2A||i A - + A(||A||i - ||A A ||i), 
where we set for brevity A = ||M||oo. Under the assumption A > 2 A this yields 
P A -^o|l! 2 (n) < P-^o|l! 2 (n)+A(||i A -A||i+mi|i-||i A ||i) < ||A-A ||| 2(n) +2A||A|| 1 , 



and the bound (12. 3p follows. 

To prove the remaining bounds, note that a necessary condition of extremum in 
problem (|1.5p implies that there exists V E 9||A A ||i such that, for all A E A, 



2 " 



2(A X ,A X - A) Lm - ( - Y i X u A x -A) + \(V, A x -A}< 0. (2.6) 



n 

i=l 



Indeed, since A x is a minimizer of L n (A) in A, there exists a matrix B S dL n (A x ) such 
that —B belongs to the normal cone of A at the point A x . It is easy to see that B can 
be represented as follows 



B 



2 f (A x , X)XU(dX) - - V + AT>, 



where V" £ 9||^4 A ||i. The condition that —B belongs to the normal cone at the point A x 
implies that {B, A x - A) < 0, and t$2M) follows. 

Consider an arbitrary A £ A of rank r with spectral representation A = X^=i a j u j v J 
and with support (Si, S2). It follows from (12. 6ft that 

2(i A - A), i A - A) L2[n) + A(t> - V, A x - A) < -\{V, A x - A) + 2(M, A x - A) (2.7) 

for an arbitrary V £ By monotonicity of subdifferentials of convex functions, 

(V — V, A x — A) > 0. On the other hand, by (12. ip . the following representation holds 

r 

V = ^vJ+P 8 ±WPg± 1 

i=i 

where W is an arbitrary matrix with ||W||oo < 1- It follows from the trace duality that 
there exists W with ||W||oo 5; 1 such that 

(i^P 5 x,A A -A> = (P st WP s ,,A x ) = (W,P st A x P s ^) = \\P s a x P s ±\\i, 

where in the first equality we used that A has the support (Si, 82)- For this particular 
choice of W, (|2.7j) implies that 



2(A X - A ,A X - A) L2(n) + X\\P s ±A x P s ±\\i < -\(Y j u J v],A x -a\+ 2(M,A X - A). 

\j=i I 

(2.s; 

Using the identity 



2(A X - A ,A X - A) Lm = \\A X - A \\l m + \\A X - A||| 2(n) - \\A - A ||| 2(n) (2.9) 
and the facts that 

n5>i«7n°o = x > (l>^7> iA - ^) = (2^7,p Sl (i A - a)p 52 \ (2.10) 
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we deduce from (|2.8|) that 

P A " M\l 2( n) + P A - A||i 2(n) + A||P s xi A P s x ||: < 

M " A o\\l 2i u) + M\PsM X - A)P S2 ||x + 2(M, i A - A). (2.11) 

To provide an upper bound on 2(M, A x — A) we use the following decomposition 

(M, A x -A) = (V A (M),A X - A) + (P s xMP s ±,A x - A) 
= (V A (M),V A (A x -A)) + (P st MP s ^A x ), 

where P^M) = M — P S ±M.P S ±. This implies, due to the trace duality, 

2|(M,i A -^i)| < A\\p A (A x -A)\\ 2 + T\\p s j r A x p s ±\\ 1 

< A||i A -A[| 2 + r||P s x A X P S: 



where 



Note that 



A = 2||P A (M)|| 2j r = 2||P s xMP s 
T < 2HMIU = 2A. 



Since 



Va(M) = P s ^MP S2 +P Si M 
and rank(Ps i ) < rank(^4), j = 1, 2, we have 

A < 2 v / rank(P j4 (M)) < 2 v / 2rank(^) A < y / 2rank(^4) A. 

Due to the fact that 

||P Sl (i A -A)P 52 ||i < V / r a nk(A)\\P Sl (A X - A)P S2 \\ 2 < Vrank(A)||i A 

and to the Assumption 1, it follows from ()2.11|) and ()2. 13j) that 

P A " ^o||! 2( n) + P A - A\ \l^+_M \Pst AXp s^h 
< \\A - A ||| 2(n) + fi(Xy^k(A) + A)||i A - A\\ Lm 

+ r||P 5 xi A P 5 x|| 1 . 

Using the above bounds on A and T, we obtain from (|2.17p that 

P A " Ml 2 (n) + P A " + (A - 2A)||P s xi A P s x|| 1 

< \\A - A Q \\l m + (1 + V2)^AVrank(A)||A A - A\\ L2{n) 



(2.12) 
(2.13) 



(2.14) 
(2.15) 
(2.16) 



Ah 



(2.17) 



which implies 

P A " A)Hi 2( n) + (A - 2A)[|P s xA A P s x|| 1 
< \\A - A \\l 2(n) + i(l + V2)V 2 A 2 rank(A). 

□ 

The following immediate corollary of Theorem Q] provides a bound for the Frobenius 
error. 

Corollary 1 Let A be a convex subset of mi x rri2 matrices containing A$, and let 
Assumption^ be satisfied. If X > 2HMHOO, then 

\A X - A Q \\l < A/i 2 min { 2\\A \\i, ( 1 \^ ) A/i 2 rank(A)) } . (2.18) 



Next we consider a version of Theorem Q] under weaker assumptions than Assump- 
tion 1 that are akin to Restricted Eigenvalue condition in sparse estimation of vectors. 
For simplicity, we will do it only when the domain A of minimization in (jl.5p is a linear 
subspace of W rtlXni2 . Recall that, given A £ A with support (Si, S2), we denote 

P A (B) := B — P st BP st , Vi(B) := P s xBP si ., B G 

and, for cq > 0, define the following cone of matrices: 

C A , C0 :={B€A: < c \\Va(B)\\i} . 

Finally, define 

:= inf{^ > : ||^(B)|| 2 < fi\\B\\ L2{IL) , G C A , Co }. 



CO 1 



Note that /i co (j4) is a nondecreasing function of cq. For Co = +00, the quantity fJ-oo(A) 
has a simple meaning: it is equal to the norm of the linear transformation B 1— > Va(B) 
from the space A equipped with the L2(II)-norm into the space of all matrices equipped 
with the Frobenius norm. For cq = 0, fJ>o(A) is the norm of the same linear transformation 
restricted to the subspace of A consisting of all matrices B G A with Vjt(B) = 0. We 
are more interested in the intermediate values, Co G (0, +00). In this case, fJ> C0 (A) is 
the "norm" of the linear mapping Va restricted to the cone of matrices B for which 
Va(B) is the "dominant" part and V^(B) is "small". Note that the rank of V A (B) is 
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not larger than 2rank(yl), so, when the rank of A is small, the matrices in Ca co are 
approximately "low rank". The quantities of the same flavor have been previously used 
in the literature on Lasso, Dantzig selector and other methods of sparse estimation of 
vectors. In these problems, they can be expressed in terms of "restricted eigenvalues" of 
certain Gram matrices, cf. the Restricted Eigenvalue condition in [UJ for the fixed design 
case and similar distribution dependent conditions in |16] for the random design case. 
Such conditions are also considered in [2D] for the matrix case. In what follows, we use 
the value cq = 5 and set /jl(A) := ^(A). 



Theorem 2 Let A be a linear subspace ofR miXm2 . If X > SHM^, then 
P A " A)Hi 2( n) < mf | ., - ..oil; ,,1, 



A - A \\l 2{u) + AV(,4)rank(.4)J . (2.19) 

PROOF. Fix A £ A with support (Si,S 2 ). If (A x - A ,A X - A) ia (n) < 0, then we 
trivially have ||^4 A — A)|lx, 2 m) — II ^ ~~ ^olli 2 (n) * n y i ew °f (j2.9|) . Thus, assume that 
(A x - A ,A X - ^4)L 2 (n) > 0. In this case, f[2T8|) and an obvious modification of (|2.10|) 
imply 

A|| P S ±A X P S ± ||i < X\\V A (A X -A)||i +2(M,i A - A). (2.20) 

Now, 

(M, A x — A) = (M,V A (A X - A)) + (M,Vi(A x - A)) 
< ||M|U (\\P A (A X - A) ||i + \\Vi(A x - A)||i) . (2.21) 

By (12301) and (I2T2TT) . 

(A - 2A)\\V A -(A X - A)\\t < (A + 2A)||^(i A - A)\\l (2.22) 

For A > 3 A, this yields 

\\V^(A x -A)\\ l <5\\V A (A x -A)\\ 1 , 

which implies that A x - A G Ca,5, and thus \\Va(A x - A)\\ 2 < ^(,4)||i A - A\\ L ^ u y 
Combining this inequality with ([2TTT]) . (|2T2|) . ([2TT3|) , (|2TT5|) and using that A > 3A, 
after some algebra we get 

P A " Ml m + P A " ^llL(n) + (A/3)||P s xi A P 5 x||i 

< \\A - A ||| 2(n) + (1 + 2V2/3) / u(A)A v / rank(A) ||i A - A|| L2(n) 

< ||A - A ||| 2(n) + ||i A - A\\l 2{u) + /i 2 (A)A 2 rank(A). 
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□ 

As a simple example, consider the case when mi = 1712, A is the space of all diagonal 
matrices, and X{ also belong to A. Then the trace regression model (|1.2|) becomes the 
usual linear regression model. The Schatten p-norms are in this case equivalent to the 
£p-norms with the operator norm || • being the ^-norm and the rank of matrix A 
characterizing the sparsity of the corresponding vector. The problem of minimizing the 
functional L n (A) over the space A is a Lasso- type penalized empirical risk minimization. 
In particular, it coincides with the standard Lasso if all Xi are non-random. Inequalities 
of Theorem [T] and (|2.19p become, in this case, sparsity oracle inequalities for the Lasso- 
type estimators. It is noteworthy that these inequalities are sharp (i.e., with leading 
constant 1), which was not achieved in the past work. The random matrix M is also 
diagonal and its norm HMH^ is just the ^oo-norm of the corresponding random vector, 
which is the sum of independent random vectors. Hence, it is easy to provide probabilistic 
bounds on HMHoo using, for instance, the classical Bernstein inequality and the union 
bound. We give an example of such an application of Theorem in Section T5.31 



3 Upper bounds for matrix completion 

In this section we consider implications of the general oracle inequalities of Theorem 
[1] for the model of USR matrix completion. Thus, we assume that the matrices Xi 
are i.i.d. uniformly distributed in the matrix completion basis X, which implies that 
ll^lli 2 (n) = ( m i m 2) _1 ||^4||2 f° r an matrices A e R miXm2 , and we set [i = ^m\fai. The 
estimator A x is then defined by (here and further on we set A = W niXm2 in the case of 
matrix completion): 

i A = argmin AeRmi xm 2 ( - ( - ^TYiX^A) + X\\A\\i ) 

= argmin^g^mixma — XH2 + Amim2||^4||i^ , (3-1) 

where 



i=l 

We can also write A x explicitly: 

A x = - Amim 2 /2) + -u i (X)^(X) T (3.2) 
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where x + = max{x,0}, crj(X) are the singular values and Uj (X) , Vj (X) are the left and 
right singular vectors of X = Y^j^i aj(X.)uj(X.)vj(X.) T . Thus, the computation of 
A x is simple; it reduces to soft thresholding of singular values in the SVD of X. To see 
why (|3.2p gives the solution of (|3,ip . note that, in view of (|2.ip . the subdifferential of 
F(A) = \\A — X||| + Amirn-2||A||i is the set of matrices 

dF(A) = \2{A - X) + Amim 2 ^J^ujv] + P s xWP s ^j : [|W||oo < l} , 

where r,Uj,Vj, Si, S2 correspond to the SVD of A. Since A 1— >• F(A) is strictly convex, 
the minimizer is unique, and the condition £ <9F(^4 A ) is necessary and sufficient 
characterization of the minimum, where is the zero mi x mi matrix. Considering 

3: <Xj(X)<Amim2/2 



it is easy to check that (I3.2p satisfies this condition. 

In view of Theorem dj to get the oracle inequalities in a closed form it remains 
only to specify the value of regularization parameter A such that A > 2HMHOQ with high 
probability. This requires some assumptions on the distribution of (Xj, Yi), and the value 
of A will be different under different assumptions. We will consider only the following 
two cases of particular interest. 

• Sub- exponential noise and matrices with uniformly bounded entries. There exist 
constants a, c\ > 0, a > 1 and c such that 

max Eexp < c, > cicr 2 , VI < i < n, (3.3) 

and maxj.j |ao(i,j)| < a for some constant a. 

• Statistical learning setting. There exists a constant 77 such that maxj = i ... n \Yi\ < rj 
almost surely. 

In both cases, we obtain the upper bounds for ||M||oo (that we call the stochastic error) 
using the non-commutative Bernstein inequalities, cf. Section [UJ The resulting values of 
A and the corresponding oracle inequalities are given in the next two theorems. 

Set m = mi +m2- In what follows, we will denote by C absolute positive constants, 
possibly different on different occasions. 
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Theorem 3 Let Xi be i.i.d. uniformly distributed on X, and the pairs (Xi,Yi) be i.i.d. 
Assume that maxj j \ao(i,j)\ < a for some constant a, and that condition \3. 3\) holds. 
For t > 0, consider the regularization parameter A satisfying 



J / i + log(m) (t + log(m))log 1/Q (mi A m 2 ) 1 

A > Glcr V a) max < W- — , >, 3.4 

~ v ; 1 Y (mi Am 2 )n n J 

where C > is a large enough constant that can depend only on a,c\,c. Then with 
probability at least 1 — 3e~* we have 

\\A X - A \\l < \\A - A \\l + mim 2 min |2A[|A||i, + 2 ^ ^ m 1 m 2 \ 2 rank(A)| (3.5) 
/or a// ^ G R miXm2 . 

Theorem 4 Let Xi be i.i.d. uniformly distributed on X . Assume that maxf=i n < r/ 
almost surely for some constant n. For t > consider the regularization parameter A 
satisfying 

A>4,max(,/^±M^, 2(t + logM) 1 
I y (mi A m2)n n I 

T/ien with probability at least 1 — e~* inequality $3. 51) Zio/ds /or a// ^4 G ]R m i xm 2. 

Theorems [3] and U] follow immediately from Theorem Q] and Lemmas [H [2] and [3] with 



/i = A /mim 2 . 

Note that the natural choice of t in Theorems [3] and U] is of the order log(m), since a 
larger t leads to slower rate of convergence and a smaller t does not improve the rate but 
makes the concentration probability smaller. Note also that, under this choice of t, the 
second terms under the maxima in (|3.4p and (|3.6|) are negligible for the values of n, mi, m 2 
such that the term containing rank(Ao) in (|3.5p is meaningful. Indeed, if t is of the order 
log(m), the condition that mim 2 A 2 <C 1 necessarily implies n 3> (mi Vm 2 )log(m). On 
the other hand, the negligibility of the second terms under the maxima in (j3.4j) and (|3,6p 
is approximately equivalent to n > (mi A m 2 ) log 1+2 / Q (m) and n > (mi A m 2 )log(m) 
respectively. Based on these remarks, we can choose A in the form 



X = C,J . y , (3.7) 
y (mi A m 2 )n 
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where c* equals either a V a or rj and the constant C* > is large enough, and we 
can state the following corollary that will be further useful for minimax considerations. 
Define r > by 



2 



l + ^\ 2 n 2 2 Mlog(m) 



/ n 

where M = max(mi,m2), and m = m\ + m2- 

Corollary 2 Let one of the sets of conditions (i) or (ii) below be satisfied: 

(i) The assumptions of Theorem^ with A as in \3. n > (mi A m?) log 1+2 / a (m), 
c* = a V a, and a large enough constant C* > that can depend only on a,ci,c. 

(ii) The assumptions of Theorem^ with n > 4(mi A m2)log(m) ; A as in flg. 7| ), 
c* = rj, and C* = 4. 

Then, with probability at least 1 — 3/(mi + m<i), 

' '|i A -^ |li< min f — - — \\A - A \\l + T 2 rank(A)) , (3.8) 



77117712 AeR m l xm 2 \777i77l2 

and, in particular, 



1 „ , „9 /l + \ / 2\ 2 ^?9 1 , ,Mrank(A ) . , 

" M\l < — 7T~ Clcl log(m) ^ , (3.9) 



77717772 \ 2 J 77 

where M = max(mi, m 2 ), a77c? m = mi + m 2 - Furthermore, with the same probability, 

-L-t* - Ml <- T»:. ^\ <- inf M . (3.0, 

77717772 J— J" [ 77717772 J 0<<J<2 {m\m2) ql 1 



PROOF. Inequalities (|3.8|) and f|3.9|) are straightforward in view of Theorems [3] and [H 
To prove (I3.10P it suffices to note that, for any k > 0, < q < 2, 

min(p-Ao||| + K 2 rank(yl)) = ^^H^^^iM} = ^Ys^l 1 ' (^^) 2 ) 

j j 

< . 2 ^min{l,(^) 9 }<^p || ? . 

□ 

Inequality (|3.9p guarantees that the normalized Frobenius error (miTT^)^ 1 ||^4 A — 
Ao||| of the estimator A x is small whenever 77 > C(mi V 7772) log(m)rank(^4o) with a 
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large enough C > 0. This quantifies the sample size n necessary for successful matrix 
completion from noisy data. 

Keshavan et al. |15j . Theorem 1.1, under a sampling scheme different from ours 
(sampling without replacement) and sub-gaussian errors, proposed an estimator A sat- 
isfying, with probability at least 1 — (mi A m.2) -3 , 

1 ,, i ,,,9 ^ n; , / \ Mrank(Ai) 

l \A-A g^Cy/p log(n) (3.11) 



17111712 n 

where C > is a constant, and j3 = (ra 1 Vm 2 )/(miAm2) is the aspect ratio. A drawback is 
that the construction of A in [15] requires the exact knowledge of rank(j4o). Furthermore, 
the bound (|3.11|) is suboptimal for "very rectangular" matrices, i.e., when /3 3> 1. 



4 Lower Bounds 

In this section, we prove the minimax lower bounds showing that the rates attained by 
our estimator are optimal up to logarithmic factors. Note that here we cannot apply 
the lower bounds of Theorem 6 in [23J for USR matrix completion on the Schatten balls 
because they are achieved on matrices with entries, which are not uniformly bounded 
for miTO2 3> n. 

We will need the following assumption, which is similar in spirit but, in general, 
substantially weaker than the usual Restricted Isometry condition. 

Assumption 2 (Restricted Isometry in Expectation.) For some 1 < r < min(mi, m^) 
and some < fi < 00 that there exists a constant 5 r £ [0, 1) such that 

(1 - S r )\\A\\ 2 < MPIk(n) < (1 + S r )\\A\\ 2 , 
for all matrices A E ]j m i xm 2 w nh rank at most r. 

For the particular case of fixed Aj (cf. Example 4 in the Introduction), Assumption[2] 
coincides with the matrix version of scaled restricted isometry with scaling factor \x [23] . 

We will denote by inf^ the infimum over all estimators A with values in M. miXm2 . 
For any integer r < min(mi, 7712) and any a > we consider the class of matrices 

A(r,a) = {A € M m i xm 2 : rank(A ) < r, max|o (i,j)l < «} • 

For any A £ ]j m i xm 2 5 \ e i p A denote the probability distribution of the observations 
(Ai, Yi, . . . , A„, Y n ) with E{Y^Xi) = {A, A;). We set for brevity M = max(mi, m 2 ). 
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Theorem 5 Fix a > and an integer 1 < r < min(mi, 777,2). Assume that Mr < n, and 
that conditionally on Xi, the variables £j are Gaussian M(0, a 2 ), a 2 > 0, for i = 1, . . . , n. 
Let Assumption^ be satisfied with some \x > 0. Then 

( - Mr\ 

inf sup F Ao (\\A-A \\ 2 >c(aAa) 2 >/?, (4.1) 

A A eA(r,a) V 1 ' n J 

where (3 G (0, 1) and c > are absolute constants. 



PROOF. Without loss of generality, assume that M = max(mi, 771,2) = m i > m 2- For 
some constant < 7 < 1 we define 

fi 2 r \ 1/2^ 

-777277/ 

and consider the associated set of block matrices 



C = {i = (oij) G M miXr : aij G A °)(£^) 1/2 } > V1 < * < ™l> 1 < J < r }. 



B{C) = {A=(A\... I i I O ) G M miXm2 : i G C}, 

where O denotes the 777 1 x (7772 — r\m,2/r\) zero matrix, and [x\ is the integer part of x. 

By construction, any element of B(C) as well as the difference of any two elements 
of B(C) has rank at most r and the entries of any matrix in B(C) take values in [0,o]. 
Thus, B{C) C A(r,a). Due to the Varshamov- Gilbert bound (cf. Lemma 2.9 in |26j), 
there exists a subset A C B(C) with cardinality Card(^4°) > 2 rmi / 8 + 1 containing the 
zero mi x m 2 matrix and such that, for any two distinct elements Ai and A2 of .4°, 



A 1 -A 2 f 2 > 1 ^f( 1 2 (aAa) 2 -^) — >^Aa)^. (4.2) 



777277 

In view of Assumption [21 this implies 



777,2 



7 2 2l Jj2 ' m i r 



r J 16 77 



Pi - A 2 \\\ m >(1 - 5 r ) 2 ^(a A a) 2 ^f. (4.3) 

Using that, conditionally on Xi, the distributions of £j are Gaussian, we get that, for any 
A G Aq, the Kullback-Leibler divergence K(Fq,Fa) between Pq and Fa satisfies 



n 11 ,112 / n , x a27 2 



K(P ,Pa) = ^2Plli 2 (n)<(l + ^) 2 -^-m 1 r. (4.4) 
From (14. 4|) we deduce that the condition 

Card( L _ i E K ^o,F A ) < alog(Card(^°)-l) (4.5) 
1 ' AeA° 
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is satisfied for any a > if 7 > is chosen as a sufficiently small numerical constant 
depending on a. In view of (|4.3p and (|4.5p . the result now follows by application of 
Theorem 2.5 in |26j . □ 

In the USR matrix completion problem we have H^-H^m) = ( m l m 2) for all 

matrices A G ]g m i xm 2 Thus, the corresponding lower bound follows immediately from 
the previous theorem with 5 r = and \i = */mi 777,2- 

Theorem 6 Let the assumptions of Theorem [5| be satisfied, and let the matrices Xi be 
i.i.d. uniformly distributed on X . Then 

( 1 - Mr\ 

inf sup F A J \\A - A f 2 > c(a A a) 2 > /3, (4.6) 

A A eA(r,a) \m1m2 ' n J 

where f3 £ (0, 1) and c > are absolute constants. 

Comparing Theorem [6] with Corollary (2)^i) we see that, in the case of Gaussian 
errors the rate of convergence of our estimator A x given in (|3.9)l is optimal (up to a 
logarithmic factor) in the minimax sense on the class of matrices A(r, a). 

Similar conclusion can be obtained for the statistical learning setting. Indeed, as- 
sume that the pairs (Xi,Yi) are i.i.d. realizations of a random pair (X,Y) with distribu- 
tion Pxy belonging to the class 

Pa , v = {Pxy : X~U ,\Y\ <77(a.s.), E(Y\X) = (A ,X)}, 

where IIo is the uniform distribution on X, 1 < r < min(mi, 771-2) is an integer, and 
77 > 0. 

Theorem 7 Let 77,7771,777-2,7* be as in Theorem^ Let (Xi,Yi) be i.i.d. realizations of a 
random pair (X,Y) with distribution Pxy- Then 

( 1 - Mr\ 
inf sup sup P \\A- A ||l > an 2 > /?, (4.7) 

A rank(ylo)<r 

where f3 G (0, 1) and c > are absolute constants. 

PROOF. We act as in the proof of Theorem [5] with some modifications. Assuming that 
M = max(?77i, 7772) = mi > 7772 and < 7 < 1/2 we define the class of matrices 

C = [A= fa) G M miXr : ^ G {0>7^(£^) 1/2 } , VI < i < m x , 1 < j < r}, 
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and take its block extension B(C). Consider the joint distributions Pxy such that X ~ 
Ho and, conditionally on X, Y = n with probability p Ao (X) = 1/2 + (Aq, X)/(2rj) and 
Y = -T] with probability 1 - p A() (X) = 1/2 - (A , X)/(2r)), where A £ 0(C). It is 
easy to see that such distributions Pxy belong to the class Va ,ti, and our assumptions 
guarantee that 1/4 < Pa PQ < 3 /4, rank(ylo) < r for all A £ B(C). We will denote 
the corresponding n-product measure by P^. For any A S B(C), the Kullback-Leibler 
divergence between Po and Pa has the form 

K(F ,F A ) = nE (p (X) log + (1 " Po W) log ) ~ ) • (4.8) 

Using the inequality — log(l + u) < —u + u 2 /2, V u > —1, and the fact that 1/4 < 
Pa(X) < 3/4, we find that the expression under the expectation in (|4.8p is bounded by 
2(po(X) -p A (X)) 2 . This implies 

K(F ,F A ) < JL\\A\\l 2[no) . 
The remaining arguments are identical to those in the proof of Theorem \E\ a 



5 Further results and examples 

5.1 Recovery of the rank and specific lower bound 

A notable property of the estimator A x in matrix completion setting is that it has the 
same rank as the underlying matrix Aq with probability close to 1. As a consequence we 
can establish a lower bound for the Frobenius error of A x with the rates matching up to 
constants the upper bounds of Corollary 2. 

Theorem 8 Let X{ be i.i.d. uniformly distributed on X and let A satisfy the inequality 
A > 2 1| M Hoc (as in Theorem^. Consider the estimator A x ' with A' = A/ (1 — 5) for some 
< 5 < 1. Setr = rank(i A '). Then 

f < rank(Ao). (5.1) 
If, in addition, min <Tj(Ao) > A'mim2, then 

f>rank(A)), (5.2) 

and 

5 2 

\\A X ' - A Q \\l > _ rank(A)) (Amim 2 ) 2 . (5.3) 
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PROOF. Note that X — Aq = mim^M. Using standard matrix perturbation argument 
(cf. [21], page 203), we get, for all j = 1, . . . , mi A m 2 , 

K(X) - *Mo)\ < *iCX - A)) = m^HMIU < = (1 - 5 )^p . 

Since, by (|3.2p . <7f(X) > X'mim 2 /2, we find that o>(^4o) > SX'mim 2 /2. This implies 
(|5.1|) . Now, if aj(Ao) > X'mim 2 we get 

/Y w (a \ I i~sr\ (a w ^ V n n A ' m i™2 . A'm 1 m 2 
ctj(X) > cr,(A ) - |<7j(X) - <Jj{A )\ > X mim 2 - (1 - d) > , 

and thus (|5.2[) follows. 

To prove f|5.3j) . denote by V : R m i xm 2 j^mixm 2 t j ie p ro j ec t r on the linear span 
of matrices (uj(X)t>j(X) T , j = 1, . . . ,r), where r = rank(A ). We have \\A X ' — A \\ 2 > 
\\V(A X ' -Ao)\\ 2 > \\V(A y -X)|| 2 - ||P(X-^ )|| 2 . Here \\V(A X ' - X)|| 2 = 
in view of (|3.2p and the fact that r = r, cf. (|5.ip and (15.2|) . On the other hand, (^(X — 
-^4-0 ) 1 1 2 < v / ^ll rv I||oo m i m -2 < yJrXmim 2 /2. This implies 

n iA' /. n ^ r:fX'm 1 m 2 , x ,X'm 1 m 2 \ n : X'm l m 2 
\\A -Ao|| 2 >Vrl (1-5) J=SVr . 

Corollary 3 Let the assumptions of CorollarylM be satisfied. Consider the estimator A x 
with 

, _ C*c* j log(m) 



1 — 6 y (mi A m 2 )n 

for some < 6 < 1. Set f = r&nk(A x ). Then f < rank(^4o) with probability at least 
1 — 3/ (mi + If, in addition, 



C*c* /log(m)(mi V m 2 ) , 

mm CTj(A)) > i Ty/m 1 m 2 \ , (5.4) 

j: CT .(A )^o 1 - 5 v V n 



i/ien f > rank(^4o) an d 



1 n 2A' 4 ii2 ^ ^ C * c * , / 4 , logHKVm 2 ) 

"-4 -^4q|| 2 >77: H7 rank(^ ) , (5-5) 



mim 2 4(1 — <5) 2 n 

TOi/i £/ie same probability. 

We note that the lower bound for o~j(Ao) in (I5.4p is not excessively high, since y /m~[rn 2 
is a "typical" order of the largest singular value o~i(Aq) for non-lacunary matrices Aq. 
For example, if all the entries of Aq are equal to some constant a, the left hand side of 
(|5.4p is equal to gi(Aq) = a^/mim 2 . 
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5.2 Risk bounds in statistical learning 

The results of the previous sections can be also extended to the traditional statistical 
learning setting where (Xj, Yi) is a sequence of i.i.d. replications of a random pair (X, Y) 
with X £ u m i xm 2 anc [ y gg ) anc i there is no underlying model determined by matrix 
Aq, i.e., we do not assume that E(Y|X) = (Aq,X). Then the above oracle inequalities 
can be reformulated in terms of the prediction risk 

R(A) =E[(Y - {A,X)) 2 }, VieK miXra2 . 

We illustrate this by an example dealing with USR matrix completion. Specifically, The- 
orem H] is reformulated in the following way. 

Theorem 9 Let Xi be i.i.d. uniformly distributed on X . Assume that \Y\ < rj almost 
surely for some constant n. For t > consider the regularization parameter X satisfying 
\3. 6\) . Then with probability at least 1 — e~* we have 

fl( i^ fl( , )+ ^{ 2 ,p|, 1 ,(l±^) 2 W^,} ,5,, 

for all A G M rniXm2 . In particular, under the assumptions of Corollary\${ii), 

R(A X )< min ( R(A) + 4 ( 1 + V2) % 2 log(m) M iank{A) ) . (5.7) 

Aei™i xm 2 \ V / n J 

This theorem can be also viewed as a result about the approximate sparsity. We do not 
know whether the true underlying model is described by some matrix Aq, but we can 
guarantee that our estimator is not far from the best approximation provided by matrices 
A with small rank or small nuclear norm. 

Note that the results of Theorem [9] are uniform over the class of distributions 

V v = {Pxy ■ X~T1 , \Y\ <77(a.s.)}, 

where IIo is the uniform distribution on X, and rj > is a constant. The corresponding 
lower bound is given in the next theorem. 

Theorem 10 Let n,mi,m,2,r be as in Theorem^ Let (Xi,Yi) be i.i.d. realizations of a 
random pair (X, Y) with distribution Pxy- Then 

( Mr\ 
inf sup sup P \R(A) > R(A) + erf > p, (5.8) 

A rank(A)<r P X Y&V V \ n J 

where (3 G (0, 1) and c > are absolute constants. 
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PROOF. For E(Y\X) = (Aq,X) we have R(A) = \\A - A)||l 2 (n) + °" 2 = {m 1 m 2 y 1 \\ A - 
A \\l + a 2 , where a 2 = E[(Y — E(Y\X)) 2 ] . Thus, using Theorem [7] we get 



rank(A)<r PxY&'Pr, 



2 Mr\ 



sup sup P i?(^4) > -R(A) + Clf 



n J 

1 ; . „o oMr 



> sup sup PI — : — \\A - A\\l > crj 2 '-^— ) > j3. 

rank(A)<r P X YeV A>v V m l m 2 '" " 



Inequalities (|5.7p and ()5,8p imply minimax rate optimality of A^ up to a logarithmic 
factor in the statistical learning setting. 

5.3 Sharp oracle inequality for the Lasso 

As we already mentioned in Example 4 and in the remark after Theorem [2j one can ex- 
ploit (|2.19p to derive a sharp (i.e., with the leading constant 1) sparsity oracle inequality 
for the Lasso. Indeed, if mi = m 2 = p and A and Xi are diagonal matrices, then the 
trace regression model (|l-2j) becomes 

Yi = xJ(3*+£i, i = l,...,n, 

where Xi,f3* E W denote the vectors of diagonal elements of Xi,Ao, respectively. Set 
X = (x\, . . . ,x n ) T S M. nxp to be the design matrix of this linear regression model. For 

a vector z = (zW, . . . , *(«*)) 6 R d , define \z\ q = (Ej=i ^ for 1 < g < oo and 

l^loo = maxi<j< rf 

Assume in what follows that xi are fixed. Then for A = diag(/3) we have ||^4||| 2 ^ = 
n _1 |X/3|2 , where diag(/3) denotes the diagonal p x p matrix with the components of j3 
on the diagonal. We will assume without loss of generality that the diagonal elements of 
the Gram matrix ^X T X are not larger than 1 (the general case is obtained from this by 
simple rescaling). 

The estimator A^ defined in (jl.7p becomes the usual Lasso estimator 



n 

/3 A = arg ruin { ± J^Y, - xj f3) 2 + A|/3|i 

i=l 



For a vector /? E W, we set, with a little abuse of notation, fi Co (/3) = A i c (diag(/3)), 
/i(/3) = /^5(/3). Let M(/3) denote the number of nonzero components of (3. 
For simplicity, the result is stated only in the case of Gaussian noise. 
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Theorem 11 Let £j 6e i.i.d. Gaussian A/"(0, <t 2 ) and /ei i/ie diagonal elements of matrix 



1 tT^ 



6e noi larger than 1. Ta&e 



A = Co- 



log p 



where C = 3a\/2,a > 1. T/ien, mi/i probability at least 1 



p a 1 \/7r logp ' 



i|X(^-^)H< inf { l -W-n\l + C 2 o-^ 2{mm ° gP 



(5.9) 



PROOF. Combine Theorem [2] and a standard bound on the tail of the Gaussian distri- 
bution, which assures that with probability at least 1 



~0°^ 1 \JlT logp 



IMI 



1 n 



2 logp 



Given /3 £ R p and J C {1, . . . ,p}, denote by f3j the vector in W which has the same 
coordinates as (3 on J and zero coordinates on the complement J c of J. 
We recall the Restricted Eigenvalue condition of [6]: 

Condition RE(s,co). For some integer s such that 1 < s < p, and a positive 
number c$ the following condition holds: 



k(s,co) 



mm 



mm 



|Xu| : 



JC{l,...,p}, u&R p ,u^0, y/n\uj\2 
\J\<s |«jc|i<cq|mj|i 



> 0. 



We have the following corollary. 

Corollary 4 Let the assumptions of Theorem{Tl\ hold, and let condition RE(s,5) be 
satisfied for some 1 < s < p. Then, with probability at least 1 



p° 2 1 \pn log p ' 

n 1 y ,u i3mp-.M(i3)<s [n 1 v ni k 2 (s,5) n J 

PROOF. Recall that (p) denote the canonical basis vectors of W. For any pxp diagonal 
matrix A with support (Si, £2), Si = S2 = {ej(p),j £ J}, where J C {1, ...,p} has 
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cardinality \ J\ < s, and an arbitrary p x p diagonal matrix B = diag(u), where u G 
we have 

\\V A (.B)\\i = Mi, ||Pi(B)||i = |uje|i, 

and 

Ca,c = { diag(u) : u £ MP, \ujo\ x < c \uj\i} . 

Thus 



k(s,c ) 



inf 



1 



ueRp -.u^o,M(u)<s /i co (diag (-u)) 
Since Condition RE(s, 5) is satisfied, Theorem 1111 yields the result. □ 
Remark. Oracle inequalities (|5.9p and (|5.10p can be straightforwardly extended 
to the model Yi = fi + £,i,i = 1, ...,n, where fi are arbitrary fixed values and not 
necessarily /j = xj {3* . This setting is interesting in the context of aggregation. Then 
vectors of values of some given dictionary of p functions at n given points 
and fi are the values of an unknown regression function at the same points. Under this 
model, inequalities (|5.9p and f|5. 10|) hold true with the only difference that X/3* should 
be replaced by the vector / = (/i, • • • , f n ) T - With such a modification, (|5.10p improves 
upon Theorem 6.1 of [6] where the leading constant is larger than 1. 



6 Control of the stochastic error 

In this section, we obtain the probability inequalities for the stochastic error HMH^. For 
brevity, we will write throughout || • ||oo = 1 1 • ||- The following proposition is an immediate 
consequence of the matrix version of Bernstein's inequality (Corollary 9.1 in [25 ). 



Proposition 1 Let Z\, . . . ,Z n be independent random matrices with dimensions mi x 
m2 that satisfy = and \\Zi\\ < U almost surely for some constant U and all 

i = 1, . . . , n. Define 



fill" i/ 2 II 1 n 

CT Z = max W-S^WZiZj) , -VEfZ^Z 
II n ' II n ' 

V 8=1 1 = 1 

Then, for all t > 0, with probability at least 1 — e~ l we have 
Z x + --- + Z n 



1/2 



n 



, o , , < + log(m) t + log(m) 



n 



n 



where m = mi + m2- 
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Furthermore, it is possible to replace the Loo-bound U on \\Z\\ in the above inequality 
by bounds on the weaker )/i Q -norms of \\Z\\ defined by 

U ( z a) = inf ju > : Eexp(||Z|| a /u a ) < 2}, a > 1. 



Proposition 2 Let Z, Z\, . . . , Z n be i.i.d. random matrices with dimensions mi x m-2 
that satisfy ¥,(Z) = 0. Suppose that < 00 for some a > 1. Then there exists a 
constant C > such that, for all t > 0, with probability at least 1 — e _< 

,(«) \ V« 



^1 + • • • + z n 



n 



,-< J , t + log(?Ti) TT (a) I UP' \ i + log( 
< 6 max ^ crxv , t7^ | log - -■— 1 J 



n \ o~z n 



where m = mi + 7712- 

This is an easy consequence of Proposition 2 in |17j . which provides an analogous result 
for Hermitian matrices Z. Its extension to rectangular matrices stated in Proposition [2] 
is straightforward via the self-adjoint dilation, cf., for example, the proof of Corollary 9.1 
in [25]. 

The next lemma gives a control of the stochastic error for USR matrix completion 
in the statistical learning setting. 

Lemma 1 Let X$ be i.i.d. uniformly distributed on X . Assume that maxj = i ___ n < n 
almost surely for some constant n. Then for any t > with probability at least 1 — e~ l 
we have 

||M||<2r ? max(,/^±M^, ^ + (fU) 
I y (mi A m2)n n 

PROOF. We apply Proposition [1] with Z{ = Y{Xi —M(YiXi). Recall that here Xi are i.i.d. 
with the same distribution as X and Yi are not necessarily i.i.d. Observe that 



pr||=l, ||E(X)|| = \ — - — , a\ = . (6.2) 

" " " v ; " V mim 2 A miAm 2 v ; 

Therefore, \\Zi\\ < 2r/, o~z < V°~x, and the result follows from Proposition [TJ □ 
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We now consider the USR matrix completion with sub-exponential errors. Recall 
that in this case we assume that the pairs (Xi, Yi) are i.i.d. We have 



IMI 



< 



1 n 

-^(XiXi-KOTtXi)) 
i=i 

-i n i n 



n . 
i=i 

Ai + A 2 . 



i=l 



We treat the terms Ai and A 2 separately in the two lemmas below. 

Lemma 2 Let Xi be i.i.d. uniformly distributed on X, and the pairs (Xi,Y{) be i.i.d. 
Assume that condition \3. 3\) holds. Then there exists an absolute constant C > that 
can depend only on a,Ci,c and such that, for all t > 0, with probability at least 1 — 2e~ l 
we have 



Ai < Co max 



t + log(m) (t + log(m)) log 1 '" (mi A m 2 ) 
(mi A m 2 )ra ' 



n 



(6.3) 



PROOF. Observe first that for X = X — E(X) we have 

1 



*x = —T- 
A mi A m 2 



Now, 



n 



< 



< 



±£fc(X,-EX«) 

2=1 

n 



1 n 

1 '" 



(6.4) 



(6.5) 



Set Zi = £i (Xi — EX). These are i.i.d. random matrices having the same distribution 
as a random matrix Z. It follows from (|6.2|) that ||Zj|| < 2|£j|, and thus condition (|3.3|) 
implies that < cc for some constant c > 0. Furthermore, in view of (|6.4p . we have 
0"2 < c'oox = da /(mi Am 2 ) 1//2 for some constant c' > and <7z > c\^ 2 a/ (2(mi Am 2 )) 1 / 2 . 
Using these remarks we can deduce from Proposition [2] that there exists an absolute 
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constant C > such that for any t > with probability at least 1 — e * we have 
1 " 

-J2$i(Xi-EX) 

n — * 



1=1 



r(«) \ 1/a 



, r , , /* + log(m) (a) / C^M t + log(m) 
< C max { crxy , I log - 1 



n 



< Co max 



t + log(m) (t + log(m)) log 1 / a (mi A 7712) 



(mi A m2)n ' 



Finally, in view of Condition (|3.3|) and Bernstein's inequality for sub-exponential noise, 
we have for any t > 0, with probability at least 1 — e - *, 



1 n 

n 

i=l 



< Co~ max ■ 



n n 



where C > depends only on c. We complete the proof by using the union bound. □ 
Define now 



I = max ■ 



m-2 



max 

Ki<mi 



mi 



y^ a o(*>i)> a max yZ a o(^j) > < max I a (i,j) \ y/m\ V m 2 . 
1 \ i<i<m2 »ij 



Lemma 3 Lei Xi be i.i.d. random variables uniformly distributed in X . Then, for all 
t > 0, with probability at least 1 — e~ l we have 



, / 1 + log(m) i + log(m) 

A 2 <2max< , 2max aoU, j)| 

y mim2n i,j n 

If m&Xij \ao(i,j)\ < a /or some a > 0, t/ien urat/i £/ie same probability 



(6.6) 



A2 < 2a max 



t + log(m) 2(t + log(m)) 



(mi A m2)n n 



PROOF. We apply PropositionCQfor the random variables Zi = tr(A~Q Xi)Xi— E(tr(^.Q X)X). 
Using (|6.2p we get \\Zi\\ < 2maxjj |ao(i,j)| and 

r 1 \A I 2 

al < maxi \\E((A ,X) 2 XX T )\\, \\E((A ,X) 2 X T X)\\ \ < J_°I*- . 

Thus, (|6.6p follows from Proposition [TJ □ 
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NUCLEAR-NORM PENALIZATION AND OPTIMAL 
RATES FOR NOISY LOW-RANK MATRIX COMPLETION 

By Vladimir Koltchinskii*^, Karim Lounici* and Alexandre B. 

TSYBAKOV" 1 "^ 

Georgia Institute of Technology^ and CREST§ 

Abstract This paper deals with the trace regression model where 
n entries or linear combinations of entries of an unknown mi x 7712 ma- 
trix Aq corrupted by noise are observed. We propose a new nuclear- 
norm penalized estimator of Ao and establish a general sharp oracle 
inequality for this estimator for arbitrary values of n, m\ , m,2 under 
the condition of isometry in expectation. Then this method is applied 
to the matrix completion problem. In this case, the estimator admits 
a simple explicit form and we prove that it satisfies oracle inequalities 
with faster rates of convergence than in the previous works. They are 
valid, in particular, in the high-dimensional setting m\ni2 3> n. We 
show that the obtained rates are optimal up to logarithmic factors 
in a minimax sense and also derive, for any fixed matrix Aq, a non- 
minimax lower bound on the rate of convergence of our estimator, 
which coincides with the upper bound up to a constant factor. Fi- 
nally, we show that our procedure provides an exact recovery of the 
rank of Aq with probability close to 1. We also discuss the statistical 
learning setting where there is no underlying model determined by Ao 
and the aim is to find the best trace regression model approximating 
the data. As a by-product, we show that, under the Restricted Eigen- 
value condition, the usual vector Lasso estimator satisfies a sharp 
oracle inequality (i.e., an oracle inequality with leading constant 1). 



1. Introduction. Assume that we observe n independent random pairs 
(Xi, Yi), i = 1, . . . , n, where Xi are random matrices with dimensions mi x 
m2 and Yi are random variables in R, satisfying the trace regression model: 

(1.1) E(Yi\Xi) = tr(XjA ), i = l,...,n, 

where Aq £ jj m i xm 2 - 1S an unknown matrix, E(l^|Xj) is the conditional ex- 
pectation of Yi given Xi, and tv(B) denotes the trace of matrix B. We con- 
sider the problem of estimation of Aq based on the observations (Xi, Yi),i = 
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^Supported in part by ANR "Parcimonie" and by PASCAL-2 Network of Excellence 
AMS 2000 subject classifications: Primary 62J99,62H12; secondary 60B20, 60G15 
Keywords and phrases: matrix completion, low-rank matrix estimation, recovery of the 
rank, statistical learning, optimal rate of convergence, noncommutative Bernstein inequal- 
ity, Lasso 
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1, . . . , n. Though the results of this paper are obtained for general n, mi,m2, 
the main motivation is in the high-dimensional setting, which corresponds 
to m\m% 3> n, with low-rank matrices Aq. 

It will be convenient to write the model (1.1) in the form 

(1.2) Y i = tr(xJA ) + £ i , i = l,...,n, 

where the noise variables = Yi — M(Yi\Xi) are independent and have zero 
means. 

For any matrices A, B 6 fl£ m i xm 2 ; we define the scalar product 

(A,B) = tr{A T B) 

and the bilinear form 

n 

i A ,B) L2 ( U ) = -^2^{{AXi)(B,Xi)) . 
i=l 

Here II = — Y27=l ^i? wnere H denotes the distribution of X{. The corre- 
sponding semi-norm ||^4.||i 2 m) is given by 

H^i 2 (n) = ^E E ((A^) 2 ). 

i=l 

Example 1. Matrix Completion. Assume that the design matrices Xi 
are i.i.d. uniformly distributed on the set 

(1.3) X = |ej(mi)efc (7712), 1 < j < mi, 1 < A; < m^j , 

where ek(m) are the canonical basis vectors in M m . The set X forms an 
orthonormal basis in the space of m\ x 771,2 matrices that will be called 
the matrix completion basis. Let also n < m\m2- Then the problem of 
estimation of Aq coincides with the problem of matrix completion under 
uniform sampling at random (USR) as studied in the non-noisy case (£j = 0) 
in [15, 22], and in the noisy case in [14, 24]. Considering low-rank matrices 
Aq is of a particular interest. Clearly, for such Xi we have the isometry 

(1-4) \\A\\l 2(Ji) =^ 2 \\A\\l 

for all matrices A £ ]R m i xm 2 ; wri ere /j, = 77747712, and ||^4||2 is the Frobenius 
norm of A. However, the restricted isometry property in the usual sense, i.e., 
"in probability", cf., e.g., [23], does not hold for matrix completion, since 
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for n < mim2 there trivially exists a matrix of rank 1 in the null space of 
the sampling operator. 

One can also consider more general matrix measurement models in which, 
for a given orthonormal basis in the space of matrices, a random sample of 
Fourier coefficients of the target matrix Aq is observed subject to a ran- 
dom noise. For more discussion on matrix completion with other types of 
sampling, see [9, 11, 12, 16, 18] and references therein. 

Example 2. Column masks. Assume that the design matrices Xi are 
i.i.d. replications of a random matrix X, which has only one nonzero col- 
umn. For instance, let the distribution of X be such that all the columns 
have equal probability to be non-zero, and the random entries of non- 
zero column are such that K(x^xJ^) is the identity matrix. Then 
H^llLoi) = Pll!/ m 2, £ ]R m i xm 2 ; so t h a t condition (1.4) is satisfied 
with /i = y/m2- More generally, in view of application to multi-task learn- 
ing, cf. [24], one can be interested in considering non-identically distributed 
X{. The model can be then reformulated as a longitudinal regression model, 
with different distributions of Xi corresponding to different tasks. 

Example 3. "Complete" subgaussian design. Assume that the de- 
sign matrices Xi are i.i.d. replications of a random matrix X such that {A, X) 
is a subgaussian random variable for any A £ ]j m i xm 2 i This approach has 
its roots in compressed sensing. The two major examples are given by the 
matrices X whose entries are either i.i.d. standard Gaussian or Rademacher 
random variables. In both cases, we have ||A||^ n ^ = | [ ^4. ] [ § , VA 6 K m i xm 2 5 
so that condition (1.4) is satisfied with fj, = 1. The problem of exact recon- 
struction of Aq under such a design in the non-noisy setting was studied in 
[10, 19, 23], whereas estimation of Aq in the presence of noise is analyzed in 
[10, 19, 24], among which [10, 24] treat the high-dimensional case m\vri2 > n. 

Example 4. Fixed design. Assume that all the 11; are Dirac mea- 
sures, so that the design matrices X{ are non-random. Then ||j4||^ n ^ = 

n^Li^i^) 2 ' and we get the problem of trace regression with fixed de- 
sign, cf. [24]. In particular, if mi = m,2, and A and Xi are diagonal matrices 
the trace regression model (1.2) becomes the usual linear regression model. 
Accordingly, the rank of A becomes the number of its non-zero diagonal 
elements. This observation will allow us to deduce, as a consequence of our 
general argument, an oracle inequality for the usual Lasso in sparse linear 
regression with fixed design improving [7] in the sense that the inequality is 
sharp (cf. Theorem 2 and Section 5.4). 
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The general oracle inequalities that we will prove in Section 2 can be 
successfully applied to the above examples. The emphasis in this paper will 
be on the matrix completion problem (Example 1), for which the previously 
obtained results were suboptimal. 

Statistical estimation of low-rank matrices has recently become a very ac- 
tive field with a rapidly growing literature. The most popular methods are 
based on penalized empirical risk minimization with nuclear-norm penalty 
[2-4, 6, 8-10, 14, 19, 20, 24]. Estimators with other types of penalization, 
such as the Schatten-p norm [24], the von Neumann entropy [18], penaliza- 
tion by the rank [8, 13] or some combined penalties [14] are also discussed. 

It is worth pointing out that in many applications, such as in matrix 
completion, the distribution II is known, and yet this information has not 
been exploited since the penalized estimation procedures considered in the 
literature involve the empirical risk - Y17=i0^ ~ A)) 2 . In this paper 

we incorporate the knowledge of II in the construction and we study the 
following estimator of Aq: 

(1.5) i A = argmin AeA L n (A), 
where A C ]R m i xm 2 i s a se t Q f matrices, 

(1.6) L n (A) = \\A\\l 2(n) - /-J2YiXi,A\ +X\\A\\ U 

\ n i=i ' 

A > is a regularization parameter, and \\A\\i is the nuclear norm of A. We 
will mainly consider convex sets A. Note that if all Xi are non-random, A* 
coincides with the usual matrix Lasso estimator: 

(1.7) i A = argmin AgA 

The emphasis in this paper is on the noisy matrix completion setting. 
Then the estimator A x has a particularly simple form; it is obtained from 
the matrix mima Y27=i by soft thresholding of its singular values. One 
of the main results of this paper is to show that our estimators are rate 
optimal (up to logarithmic factors) under the Frobenius error for a simple 
class of matrices A(r, a) defined by two restrictions: the rank of Aq is not 
larger than given r and all the entries of Aq are bounded in absolute value 
by a constant a. This rather intuitive class has been first considered in 
[16]. However, the construction of the estimator in [16] requires the exact 
knowledge of rank(^4o) and the upper bound on the Frobenius error obtained 



« _i z;(^--<A*i>) 2 +Aii4iii 



NUCLEAR-NORM PENALIZATION AND OPTIMAL RATES 



5 



in [16] is suboptimal (see the details in Section 3). The recent paper [14] 
obtains suboptimal bounds of "slow rate" type for matrix completion while 
[18] focuses on complex-valued Hermitian matrices with nuclear norm equal 
to 1, which is motivated by density matrix estimation problem in quantum 
state tomography. These papers do not address the optimality issue. Optimal 
rates in noisy matrix completion are derived in [24], but on different classes 
of matrices and with the empirical prediction error rather than with the 
Frobenius error. Finally, [20] discusses the optimality issue for the Frobenius 
error on the classes defined in terms of a "spikiness index" of Aq, which are 
not related to A(r,a), and suggests estimators that require prior knowledge 
about this index. 

The main contributions of this paper are the following. In Section 2 we 
derive a general oracle inequality for the prediction error \\A X — ^o|lx, 2 (n)- 
This oracle inequality is sharp, i.e., with leading constant 1, both in the case 
of "slow rate" (for matrices Aq with small nuclear norm) and in the case of 
"fast rate" (for matrices Aq with small rank). As a particular instance of 
this general result, in Section 3 we obtain an oracle inequality for the matrix 
completion problem. In Section 4, we establish minimax lower bounds show- 
ing that the rates for matrix completion obtained in Section 3 are optimal 
up to a logarithmic factor. In Section 5, we briefly discuss some other impli- 
cations and extensions of our method. Finally, Section 6 is devoted to the 
control of the stochastic term appearing in the proof of the upper bound. 

2. General oracle inequalities. We recall first some basic facts about 
matrices. Let A € M. miXm2 be a rectangular matrix, and let r = rank(^4) < 
min(?ni, 771,2) denote its rank. The singular value decomposition (SVD) of A 
has the form: A = Y^j=i o~j(A)ujvJ with orthonormal vectors u\, . . . , u r 6 
R mi , orthonormal vectors v\ , . . . , v r S M 7 ™ 2 and real numbers a± (A) > ■ ■ ■ > 
cr r (A) > (the singular values of ^4). The pair of linear vector spaces (Si, S2) 
where Si is the linear span of {u±, . . . ,u r } and S2 is the linear span of 
{v\, . . . ,v r } will be called the support of A. We will denote by Sj~ the or- 
thogonal complements of Sj, j = 1,2, and by Ps the projector on the linear 
vector subspace S of W 71 ^ , j = 1, 2. 

The Schatten-p (quasi-)norm \\A\\ p of matrix A is defined by 

i/p 




\\A\\ p = o-j(A) p for < p < 00, and \\A\\oo = <Ji(A). 

V j =1 / 

Recall the well-known trace duality property: 



tr(A T B) 



imi X7T12 
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We will also use the fact that the subdifferential of the convex function 
A i—)- ||A||i is the following set of matrices: 

r 

(2.1) dWA^ = {^Ujvj+PgxWPgX l ||^||oo < l} 

i=i 

(cf. [28]). Define the random matrix 

1 - 

(2.2) M = - YVYiXi - E(YiXi)). 

n 

%=\ 

We will need the following assumption on the distribution of the matrices 
Xi. 

Assumption 1. There exists a constant [i > such that, for all matrices 
A G A - A := {A 1 - A 2 : A ± ,A 2 £ A}, 

II4II 2 > //- 2 II/1II 2 

As discussed in the Introduction, Assumption 1 is satisfied, often with 
equality and for A = A — A = M m i xm 2 ; [ n seV eral interesting examples. The 
next theorem plays the key role in what follows. 

Theorem 1. Let A C ]R m i xm 2 ^ any set f ma trices. If \ > IWMW^, 
then 



(2-3) \\A X -A \\l m < inf 



l^-^o|li 2( n) + 2A||A||i 



AeAl 

If, in addition, A is a convex set and Assumption 1 is satisfied, then 

2 



1 + V2 



(2.4) ||A A - A ||£ 2(n) < mfjHA - A ||i 2(n) + ^— ^- j /^rank(A) 

Furthermore, in this case for all A £ A with support (Si, S2), 

(2.5) ||i A - A ||| 2(n) + (A - 2||M|U)||P s xi\P s x||i 

<P-^o|l! 2( n)+f 1± ^') /. 2 A 2 rank(A). 
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Proof. It follows from the definition of the estimator A that, for all 
A £ A, 



2 n 



L n (A x ) = \\A x \\l 2i u) ~ (-^^,i A ) +A||i A || 1 < 

\ n i=i I 

ll^llL(n)-(^E^^)+APHi=^(^). 



2 

> n 

x i=l ' 

Also, note that 

1 n 1 n _. n 

-^E(y^) = -^E«4),Xi)Xi) and - ^(E(y i X 4 ), A) = (A ,A) L2(U) . 

i=l i=l i=l 

Therefore, we have 

i A 1 1 2 n/ XX A \ ^ II A ||2 



A Wim ~ 2(A\A ) L2(U) < P||i 2(n) -2(A^o)L 2( n) + 

n > 

- K(Y i X i )),A x -A) + A(P||! - ||A A ||i), 



2 

n 

i=l ' 
which implies, due to the trace duality, 

\\A X - A \\l 2{u) < \\A - A \\l 2{u) + 2A\\A X - A\U + A(||A||i - ||i A ||i), 

where we set for brevity A = HMH^. Under the assumption A > 2 A this 
yields 

||i A -A,|H 2{n) < HA-Aoll^^+Adl^-Alli+IIAIIi-ll^lli) < P-A ||| 2(n) +2AP|| 1 , 

and the bound (2.3) follows. 

To prove the remaining bounds, note that a necessary condition of ex- 
tremum in problem (1.5) implies that there exists V E <9||A A ||i such that, 
for all A & A, 

(2.6) 2<i\ i A - A) L2{n) Y ^ A x -a) + X(V, A x -A)< 0. 



n *r . 



Indeed, since A x is a minimizer of L n (A) in A, there exists a matrix B E 
dL n (A x ) such that — B belongs to the normal cone of A at the point A x 
(cf. [5], Chapter 4, Section 2, Corollary 6). It is easy to see that B can be 
represented as follows 

B = 2 [ (A x ,X)Xn{dX) - - J^YiXi + XV, 

JR m i x m 2 Tl 
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where V € 9||j4 a ||i. The condition that —B belongs to the normal cone at 
the point A x implies that (B, A x - A) < 0, and (2.6) follows. 

Consider an arbitrary A E A of rank r with spectral representation A = 
Y^ r j=i a j u j v J an d with support (Si, S2). It follows from (2.6) that 
(2.7) 

2(A x -Ao,A x -A} L2m + X(V-V,A x -A) < —X(V, A x — A) +2(M, A x — A) 

for an arbitrary V £ <9||yl||i. By monotonicity of subdifferentials of convex 
functions, (V — V, A x — A) > 0. On the other hand, by (2.1), the following 
representation holds 

r 

V = Y,u j v] + P st WP st 
3=1 

where W is an arbitrary matrix with ||W||oo < 1- It follows from the trace 
duality that there exists W with ||W||oo < 1 such that 

{P si WP st A x -A) = {P si WP st A x ) = {W,P si A x P si ) = UP^PgxIli, 

where in the first equality we used that A has the support (Si, S2). For this 

particular choice of W, (2.7) implies that 

(2.8) 



2(i A -^o,i A -^>L 2 (n)+A||P s xi A P 5 x||i < -M^UjvJ ,A x -a\+2(M,A 



iA 



3=1 ' 

Using the identity 
(2.9) 

2{A X - A Q , A x - A) L2(U) = \\A X - A)|l! 2( n) + P A - ^ll! 2 (n) " U ~ M\l 2[ n) 

and the facts that 
(2.10) 

r 

ll^n^JlU = 1, (Y / u j v],A x -A\ = (y2 Uj vj,P Sl (A X ~A)P Sa 
j= i \ j= i I \ j= i 

we deduce from (2.8) that 

P A " M\l m + P A " ^llL(n) + A||P 5 xi A P 5 x||i < 
(2.11) \\A - A \\l 2{IL) + A||P Sl (i A - A)P S2 ||i + 2(M, A x — A). 

To provide an upper bound on 2(M, A x — A) we use the following decom- 
position 



(M, A x — A) = {V A (M) , A x - A) + (P s ± MP s i , A x — A) 
(V A (M),V A (A X - A)) + (P 5 xMP 5 x,i A ) 
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where T^M) = M — P S ±MP S ±. This implies, due to the trace duality, 

(2.12) 2\(M,A X -A}\ < A\\V A (A X - A)\\ 2 + T\\P S ±A X P S ±\\ 1 

< A\\A x -A\\ 2 + r\\p s ±A*p s ±\\ 1 , 

where 

(2.13) A = 2||P A (M)|| 2 , r = 2||P s xMP s x|| 0O . 
Note that 

(2.14) r < 2HMIIOO = 2A. 
Since 

(2.15) V A (M) = P st MP S2 + P Sl M 
and rank(P5 j ) < rank(A), j = 1,2, we have 

A < 2 v /raxd<p7(M)) ||M||oo < 2-^/2 rank(A) A < A/2rank(^) A. 
Due to the fact that 

(2-16) ^ 

||P Sl (i A - ^)P S2 Hi < V^HAWsAA* - A)P S2 \\ 2 < Vrank(^l)||i A -A 

and to Assumption 1, it follows from (2.11) and (2.12) that 

P A " ^o|li 2( n) + U x - Af Lm + X\\P st A x P s ^ ||i 
< ||A - A ||| 2(n) + //(AVrank(A) + A)||i A - A\\ L2{U) 
(2.17) + r||P s xA A P s x|| 1 . 

Using the above bounds on A and T, we obtain from (2.17) that 

U X - A)H! 2( n) + U x - A\\ 2 L2{n) + (A - 2A)||P 5 xi A P 5 x|| 1 
< \\A - A Q \\ 2 L2(n) + (1 + V2)/iAVrank(A)||i A - A|| £2(n) 

which implies 

P A " Ao|l! 2( n) + (A - 2A)||P 5 xi A P 5 x|| 1 
< \\A - A \\l m + ^(1 + V2)V 2 A 2 rank(A). 
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The following immediate corollary of Theorem 1 provides a bound for the 
Frobenius error. 

Corollary 1. Let A be a convex subset of mi x ttt-2 matrices containing 
Aq, and let Assumption 1 be satisfied. If A > 2HMHOQ, then 

(2.18) \\A X - A \\l < A/x 2 min |2||A)||i, + <f^ j A^ 2 rank(^ ) 

Next, we consider a version of Theorem 1 under weaker assumptions which 
are akin to Restricted Eigenvalue condition in sparse estimation of vectors. 
For simplicity, we will do it only when the domain A of minimization in 
(1.5) is a linear subspace of W niXm2 . Recall that, given A G A with support 
(Si,^), we denote 

V A {B) :=B- Pg±BP s ^ V\{B) := P s iBP s i, B G 

and, for cq > 0, define the following cone of matrices: 

C AiC0 := j-B G A : < c || i } • 

Finally, define 

fi C0 (A) := inf{// > : \\V A (B)h < »'\\B\\l 2 (ti), G C Aco }. 

Note that fJ- Co (A) is a nondecreasing function of cq. For Co = +oo, the quan- 
tity jUoo(^4) has a simple meaning: it is equal to the norm of the linear 
transformation B \— > Va{B) from the space A equipped with the L2(n)- 
norm into the space of all matrices equipped with the Frobenius norm. For 
Co = 0, Ho(A) is the norm of the same linear transformation restricted to 
the subspace of A consisting of all matrices B G A with V\{B) = 0. We are 
more interested in the intermediate values, Co G (0, +oo). In this case, n Co (A) 
is the "norm" of the linear mapping Va restricted to the cone of matrices B 
for which Va(B) is the dominant part and V\{B) is "small". Note that the 
rank of Va(B) is not larger than 2rank(^4), so, when the rank of A is small, 
the matrices in Ca,c are approximately "low-rank". The quantities of the 
same flavor have been previously used in the literature on Lasso, Dantzig se- 
lector and other methods of sparse estimation of vectors. In these problems, 
they can be expressed in terms of "restricted eigenvalues" of certain Gram 
matrices, cf. the Restricted Eigenvalue condition in [7] for the fixed design 
case and similar distribution dependent conditions in [17] for the random 
design case. Such conditions are also considered in [21] for the matrix case. 
In what follows, we use the value cq = 5 and set n(A) := 115(A). 
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Theorem 2. Let A be a linear subspace o/IR" 11 *™ 2 . If A > SHM^, 
then 

^-^o|li 2( n) + ^V(^)rank(^) . 

PROOF. Fix A e A with support (Si, S 2 ). If {A x - A ,A X - A) L ^ < 0, 
then we trivially have \\A X — -<4o||f, 2 (n) < \\A — y4 ||| 2 (n) m y i ew °f (2-9). 
Thus, assume that (A x — Aq,A x — A)i 2 ^yi) > 0- I n this case, (2.8) and an 
obvious modification of (2.10) imply 

(2.20) A||P s xi A P 5 ±||i < X\\V A (.A X - A)\\ 1 + 2(M,A X - A). 
Now, 

(M, A x — A) = (M,V A (A X - A)) + (M,V^(A X - A)) 

(2.21) < HMlloo {\\V A (A X - A)||i + \\Vi(A x - A)||i) . 

By (2.20) and (2.21), 

(2.22) (A - 2A)\\V^(A X - A)h < (A + 2A)\\V A (A X - A)||i. 
For A > 3A, this yields 

K(ii A -ii)iii< 511^(^-^)11!, 

which implies that A x - A G C A ,5, and thus \\Va(A x - A)\\ 2 < /J-{A)\\A X - 
a \\l 2 (u)- Combining this inequality with (2.11), (2.12), (2.13), (2.14) and 
using that A > 3 A, after some algebra we get 

P A " A)H! 2( n) + P A " ^HL(n) + (A/3)||P s x4*P s x \\ x 

< \\A - A ||| 2( n) + (1 + 2V2/3)/i(A)AvWk(I) ||i A - A|| £a(n ) 

< \\A - A \\l m + \\A X - A\\ 2 L2{U) + M 2 (A)A 2 rank(,4). 

□ 

As a simple example, consider the case when rrt\ = m 2 , A is the space of all 
diagonal matrices, and Xi also belong to A. Then the trace regression model 
(1.2) becomes the usual linear regression model. The Schatten p-norms are 
in this case equivalent to the l p -norms with the operator norm || * \\oo being 
the ^oo-norm and the rank of matrix A characterizing the sparsity of the cor- 
responding vector. The problem of minimizing the functional L n {A) over the 



(2.19) ||i A -A>Hi 2( n)< inf 
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space A is a Lasso-type penalized empirical risk minimization. In particular, 
it coincides with the standard Lasso if all Xi are non-random. Inequalities 
of Theorem 1 and (2.19) become, in this case, sparsity oracle inequalities for 
the Lasso-type estimators. It is noteworthy that these inequalities are sharp 
(i.e., with leading constant 1), which was not achieved in the past work. 
The random matrix M is also diagonal and its norm HMH^ is just the ioc- 
norm of the corresponding random vector, which is the sum of independent 
random vectors. Hence, it is easy to provide probabilistic bounds on HMHoo 
using, for instance, the classical Bernstein inequality and the union bound. 
We give an example of such an application of Theorem 2 in Section 5.4. 

3. Upper bounds for matrix completion. In this section we con- 
sider implications of the general oracle inequalities of Theorem 1 for the 
model of USR matrix completion. Thus, we assume that the matrices Xi 
are i.i.d. uniformly distributed in the matrix completion basis X, which im- 
plies that ||^4||^ 2 ^ = (mim2)~ 1 ||A||2 for all matrices A £ M m i xm 2 ) anc [ we 

set [i = yjm\m2- The estimator A^ is then defined by (here and further on 
we set A = K m i xm 2 in the case of matrix completion): 

i A = argmin AgR ro lX m 2 ( \\A\\l - l—'S^YiX^A) + A||A||i ] 

(3.1) = argmin J 4 g j R m 1 xm 2 {\\A — X||| + Amirn,2 1|^4|| i 
where 

t=i 

We can also write A x explicitly: 

(3.2) i A = M x ) - Xm 1 m 2 /2) + u j (X)v j (X) T 

j 

where x+ = max{x,0}, crj(X) are the singular values and Uj (X) , vj (X) 
are the left and right singular vectors of X = 5^=? <Tj(X)nj(X)fj(X) T . 
Thus, A x has a particularly simple form; it is obtained by soft thresholding 
of singular values in the SVD of X. To see why (3.2) gives the solution of 
(3.1), note that, in view of (2.1), the subdifferential of F(A) = \\A — X||| + 
Amim2||j4||i is the set of matrices 

8F(A) = \2{A - X) + Amim 2 [j^Ujv] + P S ±WP S ± ) : ||W||oo < l} , 



NUCLEAR-NORM PENALIZATION AND OPTIMAL RATES 



13 



where r, Uj , Vj , -Si, S2 correspond to the SVD of A. Since A h-» F(A) is strictly 
convex, the minimizer A x is unique, and the condition £ dF(A x ) is nec- 
essary and sufficient characterization of the minimum, where is the zero 
mi x m 2 matrix. Considering 

j: <Tj (X)<Amim2/2 

it is easy to check that (3.2) satisfies this condition. 

We will see that the soft thresholding representation (3.2) helps to under- 
stand in an easy way some theoretical properties of A x . However, it may not 
be always preferable for computational issues. Indeed, the standard tech- 
niques of computation of the SVD can become numerically instable when 
the dimension is high. On the other hand, we can always compute A x from 
(3.1) using the methods of convex programming free from this drawback. 

In view of Theorem 1, to get the oracle inequalities in a closed form it 
remains only to specify the value of regularization parameter A such that 
A > 2||!M!|| 00 with high probability. This requires some assumptions on the 
distribution of (Xi,Yi), and the value of A will be different under different 
assumptions. We will consider only the following two cases of particular 
interest. 

• Sub- exponential noise and matrices with uniformly bounded entries. 
There exist constants a, c\ > 0, a > 1 and c such that 

(3.3) max Eexp ( J^L ] < c, E£ 2 > Cl a 2 , VI < i < n, 

i=l,...,n V 0~ a J 

and maxjj \ao(i,j)\ < a for some constant a. 

• Statistical learning setting. There exists a constant rj such that maxj = i ... n \Yi\ < 
n almost surely. 

In both cases, we obtain the upper bounds for HMHoo (that we call the 
stochastic error) using the non-commutative Bernstein inequalities, cf. Sec- 
tion 6. The resulting values of A and the corresponding oracle inequalities 
are given in the next two theorems. 

Set m = mi + 777,2. In what follows, we will denote by C absolute positive 
constants, possibly different on different occasions. 

Theorem 3. Let Xi be i.i.d. uniformly distributed on X , and the pairs 
(Xi,Yi) be i.i.d. Assume that maxjj \ao(i,j)\ < a for some constant a, and 
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that condition (3.3) holds. For t > 0, consider the regularization parameter 

A satisfying 

(3.4) 



A > C(a V a) max 



t + log(m) (t + log(m)) log 1 / Q (m 1 A m 2 ) 
(mi A 7712)77. ' n 



where C > is a large enough constant that can depend only on a,c±,c. 

Then with probability at least 1 — 3e~* we have 

(3.5) 

||A A -^ ||2 < ||-4-^o||2+ 77i i m 2 min (2A||A||i, f 1 + — I mim 2 X 2 rank(A) 



/or a// A G R miXm2 . 

Theorem 4. Lei AQ 6e i.i.d. uniformly distributed on X . Assume that 
maxj = i v .. jn \ Yi\ < 77 almost surely for some constant rj. For t > consider 
the regularization parameter A satisfying 



(3.6) A > 477 max 



t + log(m) 2(i + log(m)) 



(mi A ?772)n ' 77 



T/ien Twrt/i probability at least 1 — e * inequality (3.5) holds for all A £ 

»mi X7712 



Theorems 3 and 4 follow immediately from Theorem 1 and Lemmas 1, 2 
and 3 with fj, = yjm\mi. 

Note that the natural choice of t in Theorems 3 and 4 is of the order 
log(m), since a larger t leads to slower rate of convergence and a smaller t 
does not improve the rate but makes the concentration probability smaller. 
Note also that, under this choice of t, the second terms under the maxima in 
(3.4) and (3.6) are negligible for the values of 77,7771,777,2 such that the term 
containing rank(v4o) in (3.5) is meaningful. Indeed, if t is of the order log(m), 
the condition that TO17772A 2 1 necessarily implies 77 3> (mi V 777.2) log(m). 
On the other hand, the negligibility of the second terms under the maxima 
in (3.4) and (3.6) is approximately equivalent to n > (mi A7712) log 1+2 ^ a (m) 
and n > (mi A 777.2) log(m) respectively. Based on these remarks, we can 
choose A in the form 



(3.7) 
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where c* equals either <jVo or rj and the constant C* > is large enough, and 
we can state the following corollary that will be further useful for minimax 
considerations. Define r > by 

2 (1 + V2Y n2 2 Mlog(m) 

V 2 J 

where M = max(mi, 1TI2), and m = mi + vn-i- 

Corollary 2. Let one of the sets of conditions (i) or (ii) below be 
satisfied: 

(i) The assumptions of Theorem 3 with A as in (3.7), n > (miAm^) log 1+2 / a (m) 
c* = a V a, and a large enough constant C* > that can depend only on 

a, ci, c. 

(ii) The assumptions of Theorem 4 with n > 4(mi A 771,2) log(m), A as in 

(3.7) , c* = rj, and C* = 4. 

Then, with probability at least 1 — 3/ (mi + 7772); 

(3.8) ||iA_^ ||2< 

min ( \\A — A0II2 + r 2 rank(A) ) , 

77717772 AGK m l xm 2 V 7771777,2 / 

and, in particular, 

, s 1 „ J a , „2 /i + \/2\ 2 ^2 2 n / N Mrank(A ) 
3.9 \\A - A | < — — C 2 c 2 log 777 , 

77717772 \ 2 I 77 

where M = max(mi, 777,2), ffid "7 = mi + 7772. Furthermore, with the same 

probability, 

(3.10) 

1 „ JA , „2 ^ \ - \ _2 a ji A 0)\ ^ ' q \\M\q 



A A -4)i< V min ^ r 2 , 3 I < inf 
^ [ mim 2 J o<g<: 



m\m2 r-f 1 ' m\vti2 J o<g<2 (mim 2 ) q / 2 

Proof. Inequalities (3.8) and (3.9) are straightforward in view of The- 
orems 3 and 4. To prove (3.10) it suffices to note that, for any k > 0, 
< 9 < 2, 



mm 

A 



|A_A || 2 + K 2 rank(^)) = £ min{ K 2 , a 2 (A>)} = k 2 £ min { X > (^^) 1 



K 2 
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Inequality (3.9) guarantees that the normalized Frobenius error (mi??^) -1 ||^4 A — 
A0W2 °f the estimator A x is small whenever n > C(mi\/ 1712) log(m)rank(74o) 
with a large enough C > 0. This quantifies the sample size n necessary for 
successful matrix completion from noisy data. 

Note that we can choose A not necessarily equal but also greater or equal 
to the right hand side of (3.7), or equivalently, A = tC*c* \J~ j^^^ n for any 
t > 1. Then the resulting oracle inequalities will remain of the same form 
with r 2 multiplied by the constant t 2 . 

Keshavan et al. [16], Theorem 1.1, under a sampling scheme different from 
ours (sampling without replacement) and sub-gaussian errors, proposed an 
estimator A satisfying, with probability at least 1 — (mi A m^) -3 , 

1 1 1 2 . „o „ n? , r \ Mrank(Ai) 

(3.11) \\A-Aq I <C\/^ log n ±M , 

m\m<i n 

where C > is a constant, and (3 = (m\ V m2)j{rn\ A m?) is the aspect 
ratio. A drawback is that the construction of A in [16] requires the exact 
knowledge of rank(^4o) (although it does not seem to require the knowledge 
of a). Furthermore, the bound (3.11) is suboptimal for "very rectangular" 
matrices, i.e., when /9 > 1. Candes and Plan [9] provide a coarser bound 
than (3.11), not guaranteeing a simple consistency when n — > 00 whatever 
are M and rank(^4o) (see [20] for more detailed comments on [9]). 

4. Lower Bounds. In this section, we prove the minimax lower bounds 
showing that the rates attained by our estimator are optimal up to logarith- 
mic factors. The argument here is close to [24] where the lower bounds are 
obtained on the Schatten balls. However, we consider different classes that 
consist of matrices with uniformly (in mi,ro2,n) bounded entries. We can- 
not apply directly the lower bounds of Theorem 6 in [24] for USR matrix 
completion on the Schatten balls because they are achieved on matrices with 
entries, which are not uniformly bounded for m\mi S> n. 

We will need the following assumption, which is similar in spirit but, in 
general, substantially weaker than the usual Restricted Isometry condition. 

Assumption 2. (Restricted Isometry in Expectation.) For some 
1 < r < min(mi,77i2) and some < fi < 00 that there exists a constant 
5 r G [0, 1) such that 

(l-S r )\\A\\ 2 <fi\\A\\ Lm < (l + S r )\\A\\ 2 , 

for all matrices A E ]R m i xm 2 w ffii rank at most r. 



NUCLEAR-NORM PENALIZATION AND OPTIMAL RATES 



17 



For the particular case of fixed Xi (cf. Example 4 in the Introduction), 
Assumption 2 coincides with the matrix version of scaled restricted isometry 
with scaling factor fi [24]. 

Remark 1 . Inspection of the proof of Theorem 5 shows that it remains 
valid if we replace 1 — 5 r and 1 + S r by arbitrary positive constants v\ and 
such that v\<V2- We use the formulation involving S r only to ease parallels 
to the usual restricted isometry condition. 

We will denote by inf^ the infimum over all estimators A with values in 
l miX ™ 2 . For any integer r < min(777i, 777,2) and any a > we consider the 
class of matrices 

A(r,a) = {A G M m i xm 2 : rank(,4o) < r, max |a (i,j)| < a} . 

For any A E ^ m i xm 2 ) denote the probability distribution of the 

observations (X\,Y\, . . . , X n ,Y n ) with E(yj|Xj) = (A, Xi). We set for brevity 
M = max(mi, m2). 

Theorem 5. Fix a > and an integer 1 < r < min(mi, 777-2). Let As- 
sumption 2 be satisfied with some /i > 0. Assume that fi 2 r < 77min(777i, 7772), 
and that conditionally on Xi, the variables £j are Gaussian Af(0, a 2 ), a 2 > 0, 
for i = 1, ... ,77. Then there exist absolute constants (3 E (0, 1) and c > 0, 
such that 

(Mr\ 
\\A-A \\ 2 >c(l-5 r ) 2 (aAa) 2 > /?. 

77 J 

Proof. Without loss of generality, assume that M = max(?77i, 7772) = 
mi > m 2- For some constant < 7 < 1 we define 

C = |l= (ay) E R miXr : ai j E {o, 7 (cjAa) (£^) , VI < i < m u 1 < j < 

and consider the associated set of block matrices 

B(C) = {A = ( A I • • • I A I O ) E R m i xm 2 : i e c}, 

where O denotes the ttt-i x (7772 — r [7772 A J) zero matrix, and \x\ is the integer 
part of x. 

By construction, any element of B(C) as well as the difference of any 
two elements of B(C) has rank at most r and the entries of any matrix in 
13(C) take values in [0, a]. Thus, B(C) C A(r,a). Due to the Varshamov- 
Gilbert bound (cf. Lemma 2.9 in [27]), there exists a subset ^4° E B(C) with 
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cardinality Card(.A°) > 2 rmi//s + 1 containing the zero mi x ni2 matrix 
and such that, for any two distinct elements A\ and A 2 of A , 



(4.2) \\A 1 -A 2 Wi>^[ 1 2 (aAa) 2 -^) 

8 \ rri2n 

In view of Assumption 2, this implies 



m 2 



7 2 2 /J? 171 ! 1 " 



16 n 



(4.3) Pi - A 2 ||| 2(n) >(1 - <5 r ) 2 ^ (a A a) 2 ^. 

Using that, conditionally on Xj, the distributions of £j are Gaussian, we get 
that, for any A £ .4.0) the Kullback-Leibler divergence X(Po,Pa) between 
Po and Fa satisfies 

2 

(4.4) X(P ,Pa) = ^Uf^Kil + Sr) 21 -™^. 
From (4.4) we deduce that the condition 

(45) CardUO) - 1 S ^( P °> F a) < alog(Card(^°)-l) 

is satisfied for any a > if 7 > is chosen as a sufficiently small numerical 
constant depending on a. In view of (4.3) and (4.5), the result now follows 
by application of Theorem 2.5 in [27]. □ 



In the USR matrix completion problem we have ||A|| 2 2 ^ = (77117712) 1 
for all matrices A 6 j^m-i xm 2 _ Thus, the corresponding lower bound follows 
immediately from the previous theorem with 6 r = and fi = y^rfTfm/T . 

Theorem 6. Fix a > and an integer r such that 1 < r < min(mi, 1712), 
Mr < n. Let the matrices Xj be i.i.d. uniformly distributed on X and let, 
conditionally on Xj, the variables £j be Gaussian N(0,o~ 2 ), a 2 > 0, for 
i = 1, . . . ,n. Then there exist absolute constants /3 E (0, 1) and c > 0, such 
that 

(1 - Mr \ 
\\A - Ao\\ 2 2 > c(a A a) 2 > P- 
m 1 m 2 n ) 

Comparing Theorem 6 with Corollary 2(i) we see that, in the case of 
Gaussian errors the rate of convergence of our estimator given m 
(3.9) is optimal (up to a logarithmic factor) in a minimax sense on the class 
of matrices A(r, a). 
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Similar conclusion can be obtained for the statistical learning setting. 
Indeed, assume that the pairs (Xi, Yi) are i.i.d. realizations of a random pair 
(X, Y) with distribution Pxy belonging to the class 

r Ao ,r, = {Pxy : X ~ n , \Y\ < r, (a.s.), E(Y\X) = (A , X) } , 

where ITo is the uniform distribution on X, 1 < r < min(mi,77t2) is an 
integer, and i] > 0. 

Theorem 7. Let n,mi,m2,r be as in Theorem 5. Let (Xi,YA be i.i.d. 
realizations of a random pair (X,Y) with distribution Pxy- Then there exist 
absolute constants f3 G (0, 1) and c > 0, such that 

( 1 - Mr\ 
(4.7) inf sup sup P \\A - A Q \\ 2 2 > erf > p. 

A rank( J 4 )<r Pxy&Va , v V m l m 2 Tl ) 

Proof. We act as in the proof of Theorem 5 with some modifications. 
Assuming that M = max(roi,m2) = mi > m% and < 7 < 1/2 we define 
the class of matrices 

2, 



C' = {I = (aij) G R miXr : aij G {^^{^y^} , VI < % < m u 1 < 3 < r 



and take its block extension B(C). Consider the joint distributions Pxy such 
that X ~ IIo and, conditionally on X, Y = r] with probability pa (X) = 
1/2 + (A ,X)/(2rj) and Y = -n with probability 1 - p Ao {X) = 1/2 - 
{Ao, X) /(2n), where Aq G B(C). It is easy to see that such distributions 
Pxy belong to the class Vao,^, and our assumptions guarantee that 1/4 < 
PA (X) < 3/4, rank(Ao) < r for all A G B{C). We will denote the cor- 
responding n-product measure by IPao- For any A G B(C), the Kullback- 
Leibler divergence between Po and Fa has the form 
(4.8) 

^(P ,Pa) = nE (p (X) log ^ + (1 - p (X)) log 1 " M *\ ) . 

V PA(X) l-p A {X)J 

Using the inequality — log(l + u) < —u + u 2 /2, V u > —1, and the fact that 
1/4 < pa(X) < 3/4, we find that the expression under the expectation in 
(4.8) is bounded by 2(p (X) -p A (X)) 2 . This implies 

K(F ,F A ) <^PIli 2( n 0) - 

The remaining arguments are analogous to those in the proof of Theorem 5. 
□ 
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5. Further results and examples. 

5.1. Recovery of the rank and specific lower bound. A notable property 
of the estimator A x in matrix completion setting is that it has the same rank 
as the underlying matrix Aq with probability close to 1. As a consequence 
we can establish a lower bound for the Probenius error of A x with the rates 
matching up to constants the upper bounds of Corollary 2. 

Theorem 8. Let AQ be i.i.d. uniformly distributed on X and let A satisfy 
the inequality A > 2HMHQO (as in Theorem 1). Consider the estimator A x 
with X' = A/(l — 5) for some < 5 < 1. Set f = rank(j4 A ). Then 

(5.1) f < rank(A ). 

If, in addition, min aAAo) > \'m\m2, then 

(5.2) r > rank(A)), 
and 

(5.3) ||i A ' - A Q \\l > 2 rank(A)) (Amim 2 ) 2 . 

Proof. Note that X — Aq = m\m 2 M. Using standard matrix perturba- 
tion argument (cf. [25], page 203), we get, for all j = 1, ... , m\ A ttt-2, 

\o-j{X)-0-j(Ao)\ < C7l(X-A ) = 77117712111^1100 < = {1-5) . 

Since, by (3.2), o>(X) > X'mim 2 /2, we find that a? (Aq) > 5\'m\m 2 /2. This 
implies (5.1). Now, if o~j(Ao) > A'mi?7i2 we get 

CTj(X) > O-j{A )- \o-j(X)-O-j{A )\ > A 777i777 2 - (1 -0) > , 

and thus (5.2) follows. 

To prove (5.3), denote by V : M m i xm 2 ^m 1 xm 2 the pro j ector on the 
linear span of matrices (itj(X)uj(X) , j = l,...,r), where r = rank(^4o)- 
We have ||i A ' - A || 2 > \\V(A X ' - A )\\ 2 > \\V(A X ' - X)|| 2 - \\V(X - A )\\ 2 . 
Here ^(A* — X)||2 = y/rX' 77717772/2 in view of (3.2) and the fact that f = r, 
cf. (5.1) and (5.2). On the other hand, \\V(X - A )\\ 2 < y/r\\M\ |oomi?77 2 < 
y/r\mim 2 /2. This implies 

\\A* -A oh >V-f( ^™ - (1 - 5) ^l\ = sV?^*. 
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Corollary 3. Let the assumptions of Corollary 2 be satisfied. Consider 
the estimator A x with 



y _ C*c* / log(m) 



1 — S W (mi A m 2 )n 

for some < 5 < 1. 5ef r = rank(^4 A ). T/ien f < rank(Ao) w'i/i probability 
at least 1 — 3/ (mi + m 2 ). //, m addition, 



. C*c* /log(m)(mi Vm 2 ) 

(5.4) mm cr,(A)) > q jV m i m 2\ > 

j:<7j(A )^0 1 - 5 V re 

i/ien r > rank(^4o) a ^ 

/ckn 1 n iA' , i|2 ^ 6 2 C7, 2 cg log(m)(mi Vm 2 ) 

(5.5) \\A -A \\ 2 > — TTo-rank^o)- 



mim 2 4(1 — 6) 2 n 

with the same probability. 

We note that the lower bound for aj(A^) in (5.4) is not excessively high, 
since ^Jm\mi is a "typical" order of the largest singular value o~i(Aq) for 
non-lacunary matrices Aq. For example, if all the entries of Aq are equal to 
some constant a, the left hand side of (5.4) is equal to o~\(Aq) = a^Jmim^. 

5.2. Risk bounds in statistical learning. The results of the previous sec- 
tions can be also extended to the traditional statistical learning setting where 
(Xi,Yi) is a sequence of i.i.d. replications of a random pair (X, Y) with 
X G 2J m i xm 2 anc [ y £ R, and there is no underlying model determined by 
matrix Aq, i.e., we do not assume that E(Y|X) = (Aq,X). Then the above 
oracle inequalities can be reformulated in terms of the prediction risk 



R(A) =E[(Y - (A, X)) 2 ] , VA£ 



We illustrate this by an example dealing with USR matrix completion. 
Specifically, Theorem 4 is reformulated in the following way. 

Theorem 9. Let X{ be i.i.d. uniformly distributed on X . Assume that 
\Y\ < i] almost surely for some constant i]. For t > consider the regular- 
ization parameter A satisfying (3.6). Then with probability at least 1 — e _t 
we have 

(5.6) R(A X ) < R(A)+ mini. 2\\\A\\i, \ 1 + <f* ) mi?re 2 A 2 rank(A)l 
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for all A 6 ]j m i xm 2_ j n particular, under the assumptions of Corollary 2(H), 



(5.7) R(A X ) < min 




M rank(^l) 



n 



) 



This theorem can be also viewed as a result about the approximate spar- 
sity. We do not know whether the true underlying model is described by 
some matrix Aq, but we can guarantee that our estimator is not far from 
the best approximation provided by matrices A with small rank or small 
nuclear norm. 

Note that the results of Theorem 9 are uniform over the class of distribu- 
tions 



where IIo is the uniform distribution on X, and n > is a constant. The 
corresponding lower bound is given in the next theorem. 

Theorem 10. Let n,m\,m2,r be as in Theorem 5. Let (Xj, YJ) be i.i.d. 
realizations of a random pair (X, Y) with distribution Pxy- Then 



where f3 G (0, 1) and c > are absolute constants. 

Proof. For E(Y\X) = (A ,X) we have R(A) = \\A - Aq\\l 2 (u) + ° 2 = 
{mim 2 )~ l \\A - A \\l + a 2 , where a 2 = E[(Y - E(Y\X)) 2 ]. Thus, using 
Theorem 7 we get 



V v = {Pxy : X~n , \Y\ <r?(a.s.)} 



(5.8) 



inf sup sup 

A rank(A)<r Pxy&'P, 




sup sup 

rank(A)<r Pxy&V, 



v 




> 



sup sup 

rank(A)<r Pxy&Va,^ 




□ 



Inequalities (5.7) and (5.8) imply minimax rate optimality of A x up to a 
logarithmic factor in the statistical learning setting. 
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5.3. Risks bounds in spectral norm. The results of the previous sections 
on the Frobenius norm can be extended to the spectral norm. In this subsec- 
tion we consider the USR matrix completion problem, i.e, we assume that 
the matrices Xi are i.i.d. uniformly distributed on X, which implies that 
= (mim 2 )~ l \\A\\l for all matrices A G M m i xm 2. 

Theorem 11. Let Xi be i.i.d. uniformly distributed on X. Consider the 
estimator A x defined in (3.1). If A > HM^, then 

3 

\\A X - AoWoo < -m 1 m 2 X. 

Proof. We have 

\\A X - A)||oo < \\A X - X||oo + mim 2 ||M||oo, 

where we recall that X = £™ =1 y^, E(X) = A and M is defined in 

(2.2). In view of (3.2), we clearly have ||^4 A — X||oo < \m\m2ll. The result 
follows immediately since < A. □ 



As a consequence of the above theorem, we can derive the optimal rate 
(up a to logarithmic factor) of USR matrix completion for the spectral norm 
when the noise is sub-exponential or in the statistical learning setting. 

Theorem 12. Let one of the sets of conditions (i) or (ii) in Corollary 
2 be satisfied. Then, with probability at least 1 — 3/(mi + m?), we have 

n a\ a 11 , / (mi V 777.2) logm 

\\A X - A)||oo < CC*c*^nT[m^J z ' & , 

where C > is an absolute constant. 



Proof. The proof of this result is immediate by combining Theorem 11 
and Lemmas 1, 2 and 3. □ 

Theorem 13. (i) Let the conditions of Theorem 6 be satisfied. Then 

(5.9) inf sup Pa ( \\A - A ||oo > c{a A a)^mim 2 \ j 1 ^ ) > P, 
A A eA(r,a) \ \ n J 

where (3 E (0, 1) and c > are absolute constants. 
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(ii) Let the conditions of Theorem 7 be satisfied. Then 
(5.10) 



inf sup sup Pi \\A — AqWoo > cr]^/mini2\ j 1 V m<1 ) > /3, 



A rank(A )<r PxY&V Ao ,r, V \ n 

where (3 € (0, 1) and c > are absolute constants. 

Proof. Note first that, in the USR matrix completion problem, Assump- 
tion 2 is satisfied with 5 r = and \i = ^JmyrfUi. 

We prove part (i) of the theorem. Consider the set of matrices Aq intro- 
duced in the proof of Theorem 5. For any two distinct matrices A±,A 2 of 
Aq, we have 



/ 7 /mi V 777-2 

(5.11) Ai - A 2 \\oo > \ TF\ a A «)V m i m 2A/ • 

V lb V n 

Indeed, if (5.11) does not hold, we get 

Pi - A 2 \\l < rank(Ai - A 2 )\\A 1 - A 2 \\ 2 O0 < j-{a A a) 2 mim 2 ^^, 

lb 77 

since rank(Ai — A 2 ) < r by construction of Aq. This contradicts (4.3). 

Next, (4.5) is satisfied for any a > if 7 > is chosen as a sufficiently 
small numerical constant depending on a. 

Combining (5.11) with (4.5) and Theorem 2.5 in [27] gives the result. 

The proof of (ii) follows the same arguments. □ 

5.4. Sharp oracle inequalities for the Lasso. As we already mentioned in 
Example 4 and in the remark after Theorem 2, one can exploit (2.19) to 
derive sparsity oracle inequalities for the usual Lasso. This is detailed in the 
present subsection. It is noteworthy that the obtained inequalities are sharp 
(i.e., with leading constant 1), which was not achieved in the previous work 
on the Lasso. 

Note that, if m\ = m 2 = p and A and Xi are diagonal matrices, then the 
trace regression model (1.2) becomes 

Yi = xJf3*+&, 7 = 1,..., 77, 

where Xi,/3* € M. p denote the vectors of diagonal elements of Xj, Aq, re- 
spectively. Set X = (x\, . . . , x n ) T G M. nxp to be the design matrix of this 
linear regression model. For a vector z = (z^\ . . . , z^) G define \z\ q = 
Mi 



(2 



j=\ \z^\ q \ for 1 < q < 00 and |z|oo = max^x^ \z^\ 
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Assume in what follows that x% are fixed. Then for A = diag(/3) we have 
ll^ll| 2 (n) = n_1 |^/^2 > where diag(/3) denotes the diagonal p x p matrix 
with the components of (3 on the diagonal. We will assume without loss of 
generality that the diagonal elements of the Gram matrix ^X T X are not 
larger than 1 (the general case is obtained from this by simple rescaling). 

The estimator A x defined in (1.7) becomes the usual Lasso estimator 



I 1 n 

l i=l 



For a vector (3 S W, we set, with a little abuse of notation, /u C[) (/3) = 
/x Co (diag(/3)), = Let M(/3) denote the number of nonzero com- 

ponents of f3. 

For simplicity, the result is stated only in the case of Gaussian noise. 

Theorem 14. Let be i.i.d. Gaussian A/"(0,o" 2 ) and let the diagonal 
elements of matrix ^X T X be not larger than 1. Take 



A = CW logP 



n 



where C = 3by/2,b > 1. Then, with probability at least 1 , \ , = 

p" J-V^rlogp 

have 

(5.12) I|X(/^-/r)l!< inf [ l -W-n\l + C 2 a^ 2{mm ° gP 



we 



n /3grp [n n 

Proof. Combine Theorem 2 and a standard bound on the tail of the 

Gaussian distribution, which assures that with probability at least 1 — 
l 



p b2 1 \/Tr log p 



ML 



1 n 

n ^-^ 

i=i 



CbaJ 2X ° gP 



n 



Given f3 E W and J C {1, . . . , p}, denote by f5j the vector in R p which 
has the same coordinates as /3 on J and zero coordinates on the complement 
J c of J. 

We recall the Restricted Eigenvalue condition of [7]: 
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Condition RE(s, Cq). For some integer s such that 1 < s < p, and a 
positive number cq the following condition holds: 

/ n a . - |Xu| 2 

k{s,Co) = mm mm — = — > 0. 

JC{l,...,p}, u£R p ,u^0, yn\uj\2 

\J\<s |mjc|i<cq|wj|i 



We have the following corollary. 

Corollary 4. Let the assumptions of Theorem 14 hold, and let condi- 
tion RE(s, 5) be satisfied for some 1 < s < p. Then, with probability at least 
1 



p^ 5 1 v / ""l gp' 

(5.13) 

l '^-n\l< inf {'-w -/nil + cv Mm ° iv 



n 0em>-.M(p)<s \n k 2 (s,5) n 

Proof. Recall that ej{p) denote the canonical basis vectors of MP. For 
any p x p diagonal matrix A = diag(/3) with support (Si, 52), Si = S 2 = 
{ej(p),j € J}, where J C {1, . . . ,p} has cardinality | J| < s, and an arbitrary 
p x p diagonal matrix B = diag(u), where u £MP, we have 

\\v a {b)\\i = Mi, WPa(b)\\ 2 = \uj\ 2 , = Mi 

and 

Ca,co = { diag(«) : u G MP, \ujc\i < cq\uj\i} , ||B|| i2(n ) = — ^|X«| 2 . 

y n 



Thus, 



\\V A {B)h 

= mm -J= 7->k{s,c ) 

|Wjc|l<Co|uj|l 

Since Condition RE(s, 5) is satisfied, Theorem 14 yields the result. □ 

Remark 2. Oracle inequalities (5.12) and (5.13) extend straightforwardly 
to the model 

(5.14) Yi = fi + Si,i = l,...,n, 



\ b \\l 2 (u) 
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where fa are arbitrary fixed values and not necessarily /j = xj (3* . This set- 
ting is interesting in the context of aggregation. Then x±, . . . ,x n are vectors 
of values of some given dictionary of p functions at n given points and fi are 
the values of an unknown regression function at the same points. Under the 
model (5.14), inequalities (5.12) and (5.13) hold true with the only difference 
that X/3* should be replaced by the vector / = (/i, . . . , f n ) T . With such a 
modification, (5.13) improves upon Theorem 6.1 of [7] where the leading 
constant is greater than 1. 

6. Control of the stochastic error. In this section, we obtain the 
probability inequalities for the stochastic error HMH^. For brevity, we will 
write throughout || • ||oo = || • ||. The following proposition is an immediate 
consequence of the matrix version of Bernstein's inequality (Corollary 9.1 in 
[26]). 

Proposition 1. Let Zi, . . . , Z n be independent random matrices with 
dimensions m\ x 777-2 that satisfy E(Zj) = and \\Zi\\ < U almost surely for 
some constant U and all i = 1, . . . , n. Define 



{111™ l/ 2 II 1 n 

II n II n ^-^ 

i=l i=l 



1/2 



Then, for all t > 0, with probability at least 1 — e 
Zi + ■ ■ ■ + Z,, 



we have 



n 



t + log(m) t + log(m) 
< 2 max < o~z\l < U - 



n 



where m = m\ + 777-2- 



Furthermore, it is possible to replace the Loo-bound U on ||Z|| in the 
above inequality by bounds on the weaker i/) a -norms of \\Z\\ defined by 



U { z a) =inf {n > : Eexp(||Z|| Q /ii a ) < 2}, 



a > 1. 



Proposition 2. Let Z,Zi,...,Z n be i.i.d. random matrices with di- 



mensions 



m\ x 777.2 that satisfy E(Z) = 0. Suppose that < 00 /< 



or some 



a > 1. Then there exists a constant C > such that, for all t > ; with 
probability at least 1 — e _i 

(a) \ 1 / a 



Z 1 + --- + Z n 



n 



^ n ) ,i + log(m y 
< L max \ o-z\l , ^ z 



^Miog^ 



t + log(?77-) 



where m = m\ + 7772. 
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This is an easy consequence of Proposition 2 in [18], which provides an 
analogous result for Hermitian matrices Z. Its extension to rectangular ma- 
trices stated in Proposition 2 is straightforward via the self-adjoint dilation, 
cf., for example, the proof of Corollary 9.1 in [26]. 

The next lemma gives a control of the stochastic error for USR matrix 
completion in the statistical learning setting. 

Lemma 1. Let be i.i.d. uniformly distributed on X. Assume that 
maxj = i v .. jn \ Yi\ < rj almost surely for some constant rj. Then for any t > 
with probability at least 1 — e~ l we have 

(6.1) ||M|| < 2n max 



i + log(m) 2(i + log(m)) 



m\ A m2)n 



n 



Proof. We apply Proposition 1 with Zj = Y^Xj - E(YjXj). Recall that 
here X^ are i.i.d. with the same distribution as X and Yi are not necessarily 
i.i.d. Observe that 



■ 2) 



1 



1 



m\mi " mi A m2 
Therefore, ||Zj|| < 2n, az < T)o~x, and the result follows from Proposition 1. 



We now consider the USR matrix completion with sub-exponential errors. 
Recall that in this case we assume that the pairs (Xi,Yi) are i.i.d. We have 



IMI 



< 



1 n 

8=1 
1 n 



n . 
i=i 

Ai + A 2 . 



i=l 



tr{A^Xi)Xi - E(tr(A?X)X) 



We treat the terms Ai and A 2 separately in the two lemmas below. 

Lemma 2. Let Xi be i.i.d. uniformly distributed on X, and the pairs 
(Xi,Yi) be i.i.d. Assume that condition (3.3) holds. Then there exists an 
absolute constant C > that can depend only on a,ci,c and such that, for 
all t > 0, with probability at least 1 — 2e~ t we have 



(6.3) Ai<Ccrmax. 



t + log(m) (t + log(m)) log 1 / a (m 1 A rr^) 



(mi A m2)n 



n 
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PROOF. Observe first that for X = X — E(X) we have 



(6.4) 
Now, 



4 



1 



mi A 7712 



1 n 
n ■ ' 



i=i 



(6.5) 



< 



< 



i n 1 n 

-V&pq-EX;) + ~V&E(x) 

fl £ — * n * — * 

i=l 

1 n 

-T&iXi-EX 



8=1 



77117772 



i=l 



Set = £j (Xj — EX). These are i.i.d. random matrices having the same 
distribution as a random matrix Z. It follows from (6.2) that \\Zi\\ < 2|£j|, 
and thus condition (3.3) implies that JTg < c<7 for some constant c > 0. 
Furthermore, in view of (6.4), we have oz < c'aa^ = do j{m\ A 7772) 1 / 2 for 

some constant c' > and cr^ > cj^ 2 cr/(2(777i A7772)) 1 / 2 . Using these remarks 
we can deduce from Proposition 2 that there exists an absolute constant 
C > such that for any t > with probability at least 1 — e _i we have 




U 



( «) 



log- 



U 



(a) \ l / a 



t + log (m) 



0"Z 



< GVmax 



(t + log(77l)) log 1 / a (777i A 7712) 



(t77i A 7772)77 



n 



Finally, in view of Condition (3.3) and Bernstein's inequality for sub-exponential 
noise, we have for any t > 0, with probability at least 1 — e~ l , 



1 n 

77 f 

2=1 



< GVmaX' 



n' 77 



where C > depends only on c. We complete the proof by using the union 
bound. □ 
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Define now 



j4q|* = max 



m 2 



max Va§(i,j), 



3=1 



\ 



mi 

max Vag(i,j) 

I<j<m2 '—~ ' 
i=l 



< max \ao(i, j)|V m i V m 2 • 



Lemma 3. Xei Xj be i.i.d. random variables uniformly distributed in X . 
Then, for all t > 0, with probability at least 1 — e _t we have 



/t + log(m) . ,. t + log(m) 

•o) A2 < zmax< LAo L<t/ , 2max OoU, J) 

y m\m2n i,j n 



7/maxjj |ao(i, j)| < a /or some a > 0, f/ien with the same probability 



A2 < 2a max 



t + log(m) 2(t + log(m)) 



(mi A m2)n ' n 



Proof. We apply Proposition 1 for the random variables Zj = ti(A^ Xi)Xi- 
E(tr(^4.Q X)X). Using (6.2) we get \\Zi\\ < 2maxij \ao{i, j)\ and 

a 



< max I ||E((A ,X) 2 XX T )||, \\E((A Q , X) 2 X T X ) || J < 



?Tli?Tl2 

Thus, (6.6) follows from Proposition 1. 
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